Mixed modality searchâretrieving information across a heterogeneous corpus composed of images, texts, and multimodal documentsâis an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIPâs embedding space. Evaluated on MixBenchâthe first benchmark specifically designed for mixed modality searchâGR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75$\times$ less compute.
Mixed modality search aims to retrieve semantically relevant content when both the query and the documents may consist of different combinations of modalities, such as $\textcolor{orange}{\text{text}},\ \textcolor{magenta}{\text{image}},\ \textcolor{SkyBlue}{\text{screenshot}},\ \textcolor{purple}{\text{audio}},\ \textcolor{gray}{\text{video}}$. It differs from traditional retrieval in two key ways: first, the corpus contains a heterogeneous mix of modality types across documents; second, some documents combine multiple modalities that must be jointly interpreted. The goal is to rank documents based on semantic meaning, regardless of their modality.
Our paper can be summarized into three main points:
We will present results and analysis for the first two retrieval settings, followed by the mixed modality search scenario.
We begin with an ablated setting of mixed modality search: a heterogeneous corpus composed of unimodal documents (e.g., text-only or image-only). This setting evaluates whether a retrieval model can effectively handle the challenge of cross-modal alignment.
(a) Dataset Construction: We construct a heterogeneous corpus by randomly replacing text documents with either screenshot renderings of the text or paired images with probability $p$. Since the semantic content remains unchanged, a retrieval system with perfect cross-modal alignment should maintain the same performance regardless of $p$. (b) Initial Results $\&$ Simulation: Surprisingly, CLIP exhibits a U-shaped performance curve as text is replaced with screenshots. We attribute this behavior to the modality gap in CLIP's embedding space. A simulation experiment that artificially penalizes cross-modal documents reproduces the same U-shaped trend, confirming our hypothesis. (c) Method â GR-CLIP: Building on prior work, we propose GR-CLIP, a simple post-hoc calibration that removes the modality gap via mean-centering of text and image embeddings. (d) Improved Results: GR-CLIP flattens the U-shaped curve and significantly improves retrieval accuracy, achieving comparable or better performance than the VLM2Vec baseline with far less compute. (e) Generalization Across Models, Datasets, and Modalities: To evaluate generalization, we test GR-CLIP across three CLIP variants, three additional datasets, and three other modalities (detailed in the Appendix). In all cases, the findings and improvements hold consistently.}
We now consider another ablation to mixed modality search: the retrieval corpus is homogeneous, but each document is multimodalâcontaining both image and text modalities (a). This setup evaluates the modelâs ability to fuse multimodal information, where image and text together should provide richer semantic cues than either modality alone.
(a) Dataset Construction: Each document contains both image and text, and embeddings are obtained by fusing modality-specific features. We vary the fusion coefficient $\alpha$ to evaluate the model's ability to integrate multimodal information. (b) Results: GR-CLIP consistently outperforms CLIP across three model variants and four datasets, demonstrating that the modality gap hinders effective multimodal fusionâand that removing it significantly enhances retrieval performance.
Unifyig the findings from the above two settings and extend our analysis to the most realistic scenario: mixed modality search, where documents in the corpus may be purely text, purely image, or a combination of both (a). This setting mirrors real-world search engine challenges, where retrieval systems must operate over heterogeneous and variably multimodal content.
(a) Dataset Construction: We introduce MixBench, a benchmark where the corpus is heterogeneous and includes multimodal documents, reflecting the most realistic setting for search engines. (b) Results: Across four MixBench subsets and five CLIP variants, GR-CLIP delivers substantial improvements over the original CLIP models by eliminating the modality gap, achieving state-of-the-art performance with significantly lower computational cost.
@article{MixedModalitySearch,
title={Closing the Modality Gap for Mixed Modality Search},
author={Binxu Li and Yuhui Zhang and Xiaohan Wang and Weixin Liang and Ludwig Schmidt and Serena Yeung-Levy},
journal={arXiv preprint arXiv:2507.19054},
year={2025}
}