Why are Visually-Grounded Language Models
Bad at Image Classification?

NeurIPS 2024

Yuhui Zhang\(^{1\dagger}\), Alyssa Unell\(^1\), Xiaohan Wang\(^1\), Dhruba Ghosh\(^2\), Yuchang Su\(^3\),
Ludwig Schmidt\(^{1,2\dagger}\), Serena Yeung-Levy\(^{1\dagger}\)

\(^1\)Stanford University \(^2\)University of Washington \(^3\)Tsinghua University

\(^\dagger\){yuhuiz,ludwigsc,syyeung}@stanford.edu

arXiv Code

Abstract

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

📹 Video

🪧 Poster

💡 Paper Overview

Our paper can be summarized into three main points:

(Left) Different visually-grounded language models (VLMs) underperform CLIP in classification by a large margin, though they often use CLIP as a vision encoder.
(Middle) We investigate several hypotheses about why VLMs are bad classifiers and find that the main reason is data. Critical information for image classification is encoded in the VLM's latent space but can only be decoded with enough data during VLM training.
(Right) Based on our analysis, we improve a VLM by integrating classification data into its training, and find that the improved classification capabilities serve as foundations for more advanced capabilities such as visual question answering.

We will delve into each of these points in the following sections.

📈 VLMs are Bad at Image Classification

We evaluate VLMs and CLIPs on standard image classification benchmarks: ImageNet, Flowers102, StanfordCars, Caltech101. We performed image classification in two settings: an open-world setting where the label set is not provided and a closed-world setting where classes are concatenated in the prompt.

We find that VLMs exhibit poor performance in image classification, significantly lagging behind CLIP models.

🔮 Why are VLMs Bad Image Classifiers?

Given that VLMs underperform CLIPs at classification by a large margin, we seek to understand the reasons behind that. We investigate several hypotheses concerning major differences between VLMs and CLIPs, which can be generally categorized into inference, training, and data.

Inference

We start with inference-related questions. For example, does prompt variation, such as chain of thought, affect final performance? Does reducing the label set size in context narrow the gap between VLMs and CLIPs? Does performing probabilistic inference to force the generation into the label set help? We find none of these factors can fully close the gap between VLMs and CLIPs.

(Top) We explore prompt variation such as wording, label order, chain-of-thought and find it has limited impact on the performance. (Bottom) We leverage the probabilistic inference strategy, which improves the performance but still fails to close the gap between VLMs and CLIPs.

We randomly sample 100, 20, 5, 2 candidate classes from all the classes for each image. The performance gap between VLMs and CLIPs becomes smaller with reduced label set size but the gap always exists. X-axis: number of classes; Y-axis: accuracy (%).

Training

Since none of inference factors can fully close the gap between VLMs and CLIPs, we switch to training-related questions. For example, is the visual information from the vision encoder still preserved in the VLM's latent space? Is the text generation objective as effective as cross-entropy loss for learning classification? Surprisingly, the results show that the information is preserved, and the text generation objective is adequate for learning classification.

(Left) We conduct feature probing experiments on the VLM's last layer and find that the information required for classification is mostly preserved in the VLM's latent space. (Right) We fine-tune VLMs on the classification datasets using the text generation objective and find that the text generation training objective is as effective as the traditional cross-entropy for learning classification, which eliminates VLM-CLIP performance gap.

Data

Finally, we investigate data-related questions. For example, does the VLM training data include enough classification data and cover enough classes? We find a strong correlation between class exposure in training and model performance. Moreover, VLMs can achieve the same level of performance as CLIPs when trained with enough data. These results suggest that data is the primary cause of the poor classification performance of VLMs.

We study the relation between the ImageNet class frequency in the VLM training data and the VLM's classification performance on those classes. A strong correlation is observed, indicating that data determines VLM classification performance.

We further study whether the data type (e.g., classification or captioning data of a given class) affects VLM performance. We fine-tune the VLM on the caption-focused data generated by GPT4 using the same experimental settings and find that data is the main determining factor for VLM performance, and the data type is unimportant.

🛠️ Improving VLM with Classification Data

We believe that classification is foundational to more advanced capabilities such as visual question answering or reasoning. Based on our analysis, we propose a simple enhancement of VLMs by integrating classification-focused data into their training. We demonstrate that this data intervention not only boosts the VLM's classification performance but also enhances its general capabilities.

ImageWikiQA is a multiple-choice question-answering dataset collected by feeding the Wikipedia pages of ImageNet classes to GPT-4. We find that current VLMs perform poorly in answering these questions, suggesting that their poor classification performance is a fundamental limitation for more advanced capabilities. Integrating classification data into VLM training enhances both their classification and overall capabilities.

BibTeX


@article{VLMClassifier,
  title={Why are Visually-Grounded Language Models Bad at Image Classification?},
  author={Zhang, Yuhui and Unell, Alyssa and Wang, Xiaohan and Ghosh, Dhruba and Su, Yuchang and Schmidt, Ludwig and Yeung-Levy, Serena},
  journal={Conference on Neural Information Processing Systems (NeurIPS)},
  year={2024}
}