Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development.
Our work addresses three fundamental questions about negation understanding in vision language models:
NegVQA is constructed by systematically transforming questions from VMCBench into negated versions using GPT-4o. Our dataset covers diverse negation scenarios across multiple domains.
We employ a two-step process to create high-quality negated questions:
Our manual verification of 100 sampled questions shows 97% accuracy in the negation process, confirming the reliability of our approach.
NegVQA covers diverse negation types including object absence, attribute negation, spatial relationships, and complex reasoning scenarios across multiple domains.
Our evaluation of 20 state-of-the-art VLMs reveals significant challenges in negation understanding:
Key Finding: All VLMs show substantial performance drops on negated questions. The best-performing model, Qwen2-VL-72B, achieves 92.2% accuracy on original questions but drops to 72.7% on NegVQA—a gap of 19.5 percentage points.
Surprising Discovery: Model scaling exhibits a U-shaped trend where performance on negated questions initially degrades with increased model size before improving at the highest scales. This pattern is most pronounced in reasoning and document/chart tasks.
Human evaluation on 100 questions shows 89% accuracy, significantly higher than the best VLM's 72.7%. This 16% gap highlights the substantial room for improvement in VLMs' negation understanding capabilities.
Model | Original | NegVQA | Drop | General | Reasoning | OCR | Doc & Chart |
---|---|---|---|---|---|---|---|
Qwen2-VL-72B | 92.2 | 72.7 | 19.5 | 71.7 | 64.1 | 91.8 | 72.4 |
Molmo-72B | 87.5 | 74.5 | 13.0 | 74.8 | 64.7 | 93.9 | 72.1 |
VILA1.5-40B | 85.7 | 70.5 | 15.2 | 73.2 | 63.0 | 90.3 | 61.8 |
Cambrian-34B | 87.4 | 59.9 | 27.5 | 64.6 | 53.2 | 81.5 | 49.5 |
Qwen2-VL-7B | 88.8 | 57.2 | 31.6 | 58.8 | 51.8 | 82.0 | 53.0 |
Human Performance: 89% |
@inproceedings{NegVQA,
title={NegVQA: Can Vision Language Models Understand Negation?},
author={Yuhui Zhang and Yuchang Su and Yiming Liu and Serena Yeung-Levy},
booktitle={ACL 2025 Findings},
year={2025}
}