NegVQA: Can Vision Language Models Understand Negation?

ACL 2025 Findings

\(^1\)Stanford University \(^2\)Tsinghua University

Abstract

Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development.


🧠 Understanding Negation in VLMs

Our work addresses three fundamental questions about negation understanding in vision language models:

  • How well do current VLMs understand negation? We systematically evaluate 20 state-of-the-art VLMs and find they struggle significantly with negated questions.
  • What types of negation are most challenging? We analyze performance across different domains including general VQA, reasoning, OCR, and document/chart understanding.
  • How does model scaling affect negation understanding? We discover a surprising U-shaped scaling trend where larger models initially perform worse before eventually improving.


📊 NegVQA Dataset

NegVQA is constructed by systematically transforming questions from VMCBench into negated versions using GPT-4o. Our dataset covers diverse negation scenarios across multiple domains.

NegVQA Examples

Dataset Construction Process

We employ a two-step process to create high-quality negated questions:

  1. Question Negation: We prompt GPT-4o to generate negated versions while preserving syntactic structure and meaning. For example, "Who wrote this book?" becomes "Who did not write this book?"
  2. Answer Adjustment: We modify answer choices to reflect the negation, ensuring the negation meaningfully impacts answer selection.

Our manual verification of 100 sampled questions shows 97% accuracy in the negation process, confirming the reliability of our approach.


Example Questions from NegVQA

NegVQA covers diverse negation types including object absence, attribute negation, spatial relationships, and complex reasoning scenarios across multiple domains.



📈 Key Results

Our evaluation of 20 state-of-the-art VLMs reveals significant challenges in negation understanding:


VLM Performance on NegVQA

Performance Drop on Negated Questions

Key Finding: All VLMs show substantial performance drops on negated questions. The best-performing model, Qwen2-VL-72B, achieves 92.2% accuracy on original questions but drops to 72.7% on NegVQA—a gap of 19.5 percentage points.


U-Shaped Scaling Trend

Surprising Discovery: Model scaling exhibits a U-shaped trend where performance on negated questions initially degrades with increased model size before improving at the highest scales. This pattern is most pronounced in reasoning and document/chart tasks.


Human vs. Model Performance

Human evaluation on 100 questions shows 89% accuracy, significantly higher than the best VLM's 72.7%. This 16% gap highlights the substantial room for improvement in VLMs' negation understanding capabilities.


Performance Leaderboard

Model Original NegVQA Drop General Reasoning OCR Doc & Chart
Qwen2-VL-72B 92.2 72.7 19.5 71.7 64.1 91.8 72.4
Molmo-72B 87.5 74.5 13.0 74.8 64.7 93.9 72.1
VILA1.5-40B 85.7 70.5 15.2 73.2 63.0 90.3 61.8
Cambrian-34B 87.4 59.9 27.5 64.6 53.2 81.5 49.5
Qwen2-VL-7B 88.8 57.2 31.6 58.8 51.8 82.0 53.0
Human Performance: 89%

BibTeX

@inproceedings{NegVQA,
  title={NegVQA: Can Vision Language Models Understand Negation?},
  author={Yuhui Zhang and Yuchang Su and Yiming Liu and Serena Yeung-Levy},
  booktitle={ACL 2025 Findings},
  year={2025}
}