Closing the Modality Gap Aligns Group-Wise Semantics

Abstract

In multimodal learning, CLIP has been recognized as the de facto method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general n-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

TL;DR

Retrieval is rank-based. A cross-modal offset leave relative rankings almost unchanged.

Clustering is geometry-sensitive. Absolute placement and within-class scatter impact group-wise semantics.

Closing the gap aligns groups. The latent space becomes semantically shared rather than ranking-based across modalities.

Core intuition

Retrieval and clustering respond differently to the modality gap.

How and Why?

Retrieval depends on relative ranking, so a global offset can leave top matches mostly intact.

Clustering depends on the absolute geometry of the latent space, so the same offset increases within-class scatter and weakens semantic grouping.

As the gap is reduced, clustering improves strongly while retrieval stays comparatively flat.

Method

The method combines an explicit alignment term for matching pairs with a centroid-level uniformity term that preserves an expressive, well-spread latent geometry. The final loss is added on top of the contrastive objective.

$$\mathcal{L}_{\text{ATP}} = \frac{1}{M-1} \sum_{m \in M, m \neq a} \left( \frac{1}{N} \sum_{i=1}^N \left( || \mathbf{z}_i^m - \mathbf{z}_i^a ||^2_2 \right) \right)$$

$$\mathcal{L}_{\text{CU}} = \log \left( \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1, j\neq i}^{N} \exp \left( -2 || \boldsymbol \mu_i - \boldsymbol \mu_j ||^2_2 \right) \right)$$

$$\mathcal{L}_{\text{CLgap}} = \mathcal{L}_{\text{gap}} + \frac{1}{2} \cdot \left( \mathcal{L}^{(m \to n)} + \mathcal{L}^{(n \to m)} \right) ,\quad \text{where } \mathcal{L}_{\text{gap}} = \mathcal{L}_{\text{ATP}} + \mathcal{L}_{\text{CU}}$$

$ \mathcal{L}_{\text{ATP}} $

Align true pairs

Pulls matching samples across modalities closer in the space.

$ \mathcal{L}_{\text{CU}} $

Spread centroids

Prevents collapse by keeping semantic centroids uniform over the hypersphere.

Combined objective

Close the gap

Preserves pairwise alignment while improving the geometry needed for group-wise semantics.

Latent-space visualizations

On AV-MNIST, we show the contrast between initialization, CLIP-based learning, and the proposed method. The key shift is from modality-separated clusters to class-aligned multimodal neighborhoods.

PCA visualizations. Initialization begins fragmented, CLIP preserves a visible gap, and the proposed objective produces overlapping modality distributions over the shared space.

Results

MSCOCO gap

0.47 → 0.03

Large reduction in modality separation

MSCOCO V-Measure

12.98 → 23.63

Substantial group-wise improvement

MSCOCO Cosine True Pairs

0.34 → 0.77

True pairs become much more aligned

MSR-VTT gap

0.29 → 0.07

Trimodal latent space becomes tighter

MSR-VTT V-Measure

23.3 → 32.1

Clustering clearly benefits

MSR-VTT T2A R@1

10.3 → 11.8

Retrieval shifts are smaller but preserved

Dataset	Metric	CLIP (LT)	Ours	Notes
MSCOCO	Gap ↓	0.47	0.03	Shared latent space is much more coherent.
MSCOCO	V-Measure ↑	12.98	23.63	Group-wise semantics improve strongly.
MSR-VTT	Gap ↓	0.29	0.07	The same effect holds in the trimodal setting.
MSR-VTT	V-Measure ↑	23.3	32.1	Clustering benefits remain clear at larger scale.

Test it in the wild

Retrieval results with text and image query.

With gap: similarity remains entangled with modality identity.

Without gap: semantic neighborhoods become cross-modal rather than modality-specific.

Takeaway: the latent space becomes useful for grouping, browsing, and additional downstream tasks.

BibTeX

@inproceedings{grassucci2026closing,
  title     = {Closing the Modality Gap Aligns Group-Wise Semantics},
  author    = {Grassucci, Eleonora and Cicchetti, Giordano and Frasca, Emanuele and Uncini, Aurelio and Comminiello, Danilo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026}
}