Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

CVPR 2025

Yuhui Zhang\(^{1\star}\), Yuchang Su\(^{2\star}\), Yiming Liu\(^2\), Xiaohan Wang\(^1\), James Burgess\(^1\), Elaine Sui\(^1\), Chenyu Wang\(^3\), Josiah Aklilu\(^1\), Alejandro Lozano\(^1\), Anjiang Wei\(^1\), Ludwig Schmidt\(^{1\dagger}\), Serena Yeung-Levy\(^{1\dagger}\)

\(^1\)Stanford University \(^2\)Tsinghua University \(^3\)MIT

\(^\star\)Equal contribution \(^\dagger\)Equal advising

arXiv Code 🤗 Dataset

Abstract

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 28 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

📹 Video

💡 Paper Overview

Our paper can be summarized into three main points:

(Left) We analyze existing open-ended VQA evaluation metrics, underscoring their limitations in providing accurate and reproducible assessments.
(Middle) We introduce AutoConverter, a multi-agent system that automatically converts open-ended questions into multiple-choice format, enabling objective assessment while reducing the costly question creation process.
(Right) Using AutoConverter, we convert and refine 20 existing VQA datasets into a unified multiple-choice benchmark to support future VLM research.

We will delve into each of these points in the following sections.

🚧 Open-Ended Question Evaluation Challenge

In this part, we discuss the challenges of evaluating vision language models (VLMs) using open-ended questions. The primary issue lies in accurately and robustly measuring the semantic similarity between model-generated answers and ground-truth answers, a long-standing challenge in natural language processing.

We find that: (Left) rule-based metrics significantly underestimate model performance and penalize models that do not strictly follow the expected format. (Right) model-based evaluations using two different versions of GPT yield substantially different scores, making comparisons inconsistent and raising reproducibility issues.

🛠️ AutoConverter: An Agentic Pipeline Generating Challenging Multi-Choice Questions

Given the challenge of evaluating open-ended questions for vision language models (VLMs) detailed in the previous section, how can we mitigate these issues? We propose to convert open-ended questions into a multiple-choice format, capitalizing on the simplicity and objectivity of evaluating multiple-choice questions. However, traditionally, creating multiple-choice questions, especially reasonable yet challenging distractor options, requires substantial human expertise and effort. In this section, we introduce AutoConverter, an agentic pipeline that automatically generates high-quality multiple-choice questions from open-ended ones.

AutoConverter Framework and Results

(Left) AutoConverter is a multi-agent framework with two key steps: increasing difficulty and ensuring the correctness of the converted question. (Right) We perform an ablation study on AutoConverter and find that each component is crucial for enhancing question correctness and achieving the desired level of difficulty.

AutoConverter Generates Challenging Multiple-Choice Questions

Using AutoConverter, we generated distractors for questions and answers from three existing multiple-choice datasets: MMMU, MathVista, and AI2D, and compared them with original human-created distractors. We evaluated various VLMs on both the AutoConverter-generated and the original questions, finding that VLMs consistently achieved similar or even lower accuracy on the AutoConverter-generated questions compared to the original ones.

Qualitative Comparison of the Original Questions, Naive Baseline-Generated Questions, and AutoConverter-Generated Questions.

AutoConverter simulates errors from different perspectives and produces correct and challenging multiple-choice questions.

Explore AutoConverter!

🗂️ VMCBench: A Unified Multiple-Choice Visual Question Answering Benchmark

Using our AutoConverter method, detailed in the last section, we introduce VMCBench — a benchmark that unifies 20 existing visual question answering (VQA) datasets into a consistent multiple-choice format. VMCBench spans a diverse array of visual and linguistic contexts, rigorously testing various model capabilities. By transforming open-ended questions into multiple-choice format, VMCBench enhances vision language model evaluation by mitigating ambiguities while preserving the complexity of the tasks. This benchmark provides a reliable and reusable resource for the evaluation of future vision language models (VLMs).

VMCBench Overview

(Left) VMCBench is constructed by converting 12 open-ended (OE) and refining 8 multiple-choice (MC) VQA datasets into a unified multiple-choice format, with human validation ensuring correctness. The number of questions per dataset is listed. (Right) Example questions from VMCBench, showcasing diverse question types across multiple domains.

Explore VMCBench!

📈 Evaluation of VMCBench

VMCBench is officially supported by VLMEvalKit and lmms-eval. Here are the running commands:

VLMEvalKit: python run.py --data VMCBench_DEV --model llava_v1.5_7b

lmms-eval: python -m accelerate.commands.launch -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks vmcbench

Leaderboard on VMCBench Test Set

#	Model	Source	Date	Prediction	Overall	General	Reasoning	OCR	Doc & Chart

Evaluate Model Outputs on VMCBench Test Set!

BibTeX

@inproceedings{AutoConverter,
  title={Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation},
  author={Yuhui Zhang and Yuchang Su and Yiming Liu and Xiaohan Wang and James Burgess and Elaine Sui and Chenyu Wang and Josiah Aklilu and Alejandro Lozano and Anjiang Wei and Ludwig Schmidt and Serena Yeung-Levy},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}