Closing the Modality Gap for Mixed Modality Search

Stanford University
\(^\star\)Equal contribution \(^\dagger\)Corresponding author

Abstract

Mixed modality search—retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents—is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench—the first benchmark specifically designed for mixed modality search—GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75$\times$ less compute.


💡 Paper Overview

Mixed modality search aims to retrieve semantically relevant content when both the query and the documents may consist of different combinations of modalities, such as $\textcolor{orange}{\text{text}},\ \textcolor{magenta}{\text{image}},\ \textcolor{SkyBlue}{\text{screenshot}},\ \textcolor{purple}{\text{audio}},\ \textcolor{gray}{\text{video}}$. It differs from traditional retrieval in two key ways: first, the corpus contains a heterogeneous mix of modality types across documents; second, some documents combine multiple modalities that must be jointly interpreted. The goal is to rank documents based on semantic meaning, regardless of their modality.

Our paper can be summarized into three main points:

  • (a) We formalize mixed modality retrieval as retrieving from a corpus of pure-text, pure-image, and combined text-image documents, and identify two core challenges: cross-modal alignment——Retrieval with Heterogeneous Corpus, and multimodal fusion——Retrieval with Multimodal Documents.
  • (b-d) We analyze how CLIP‐based models fuse multiple modalities and reveal that the "modality gap" causes embeddings from different modalities to cluster into distinct groups, biasing similarity scores. To address this, we propose GR‐CLIP (GR stands for gap-removed), a lightweight post‐hoc calibration method that subtracts each modality’s mean embedding to center representations, resulting in significant improvement with minimal extra computation.
  • (e) We introduce MixBench, which unifying the two retrieval settings, each containing equal proportions of pure‐text, pure‐image, and multimodal documents, to simulate realistic search scenarios.

We will present results and analysis for the first two retrieval settings, followed by the mixed modality search scenario.



🛠️ Retrieval with Heterogeneous Corpus

We begin with an ablated setting of mixed modality search: a heterogeneous corpus composed of unimodal documents (e.g., text-only or image-only). This setting evaluates whether a retrieval model can effectively handle the challenge of cross-modal alignment.

(a) Dataset Construction: We construct a heterogeneous corpus by randomly replacing text documents with either screenshot renderings of the text or paired images with probability $p$. Since the semantic content remains unchanged, a retrieval system with perfect cross-modal alignment should maintain the same performance regardless of $p$. (b) Initial Results $\&$ Simulation: Surprisingly, CLIP exhibits a U-shaped performance curve as text is replaced with screenshots. We attribute this behavior to the modality gap in CLIP's embedding space. A simulation experiment that artificially penalizes cross-modal documents reproduces the same U-shaped trend, confirming our hypothesis. (c) Method — GR-CLIP: Building on prior work, we propose GR-CLIP, a simple post-hoc calibration that removes the modality gap via mean-centering of text and image embeddings. (d) Improved Results: GR-CLIP flattens the U-shaped curve and significantly improves retrieval accuracy, achieving comparable or better performance than the VLM2Vec baseline with far less compute. (e) Generalization Across Models, Datasets, and Modalities: To evaluate generalization, we test GR-CLIP across three CLIP variants, three additional datasets, and three other modalities (detailed in the Appendix). In all cases, the findings and improvements hold consistently.}



🛠️ Retrieval with Multimodal Documents

We now consider another ablation to mixed modality search: the retrieval corpus is homogeneous, but each document is multimodal—containing both image and text modalities (a). This setup evaluates the model’s ability to fuse multimodal information, where image and text together should provide richer semantic cues than either modality alone.

(a) Dataset Construction: Each document contains both image and text, and embeddings are obtained by fusing modality-specific features. We vary the fusion coefficient $\alpha$ to evaluate the model's ability to integrate multimodal information. (b) Results: GR-CLIP consistently outperforms CLIP across three model variants and four datasets, demonstrating that the modality gap hinders effective multimodal fusion—and that removing it significantly enhances retrieval performance.


MixBench: A new benchmark specifically designed for mixed modality search mirroring real-world search engine challenges

Unifyig the findings from the above two settings and extend our analysis to the most realistic scenario: mixed modality search, where documents in the corpus may be purely text, purely image, or a combination of both (a). This setting mirrors real-world search engine challenges, where retrieval systems must operate over heterogeneous and variably multimodal content.

(a) Dataset Construction: We introduce MixBench, a benchmark where the corpus is heterogeneous and includes multimodal documents, reflecting the most realistic setting for search engines. (b) Results: Across four MixBench subsets and five CLIP variants, GR-CLIP delivers substantial improvements over the original CLIP models by eliminating the modality gap, achieving state-of-the-art performance with significantly lower computational cost.

Explore MixBench!

Evaluate Model Outputs on MixBench!

BibTeX

@article{MixedModalitySearch,
  title={Closing the Modality Gap for Mixed Modality Search},
  author={Binxu Li and Yuhui Zhang and Xiaohan Wang and Weixin Liang and Ludwig Schmidt and Serena Yeung-Levy},
  journal={arXiv preprint arXiv:2507.19054},
  year={2025}
}