Published at ICLR 2026

Closing the Modality Gap Aligns Group-Wise Semantics

Eleonora Grassucci, Giordano Cicchetti, Emanuele Frasca, Aurelio Uncini, Danilo Comminiello
Department of Information Engineering, Electronics, and Telecommunications · Sapienza University of Rome
Multimodal Learning Representation Alignment Modality Gap
Aligning Group-Wise Semantics
From modality clusters in CLIP to semantic clusters with no gap
Gap: 0.86 → 0.06
Modality A Modality B Shared semantic clusters
Gap closure
Starts modality-clustered, ends semantics-clustered
Abstract
In multimodal learning, CLIP has been recognized as the de facto method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general n-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
TL;DR
1
Retrieval is rank-based. A cross-modal offset leave relative rankings almost unchanged.
2
Clustering is geometry-sensitive. Absolute placement and within-class scatter impact group-wise semantics.
3
Closing the gap aligns groups. The latent space becomes semantically shared rather than ranking-based across modalities.
Core intuition

Retrieval and clustering respond differently to the modality gap.

How and Why?

1
Retrieval depends on relative ranking, so a global offset can leave top matches mostly intact.
2
Clustering depends on the absolute geometry of the latent space, so the same offset increases within-class scatter and weakens semantic grouping.
Gap vs retrieval and clustering figure
As the gap is reduced, clustering improves strongly while retrieval stays comparatively flat.
Method

The method combines an explicit alignment term for matching pairs with a centroid-level uniformity term that preserves an expressive, well-spread latent geometry. The final loss is added on top of the contrastive objective.

$$\mathcal{L}_{\text{ATP}} = \frac{1}{M-1} \sum_{m \in M, m \neq a} \left( \frac{1}{N} \sum_{i=1}^N \left( || \mathbf{z}_i^m - \mathbf{z}_i^a ||^2_2 \right) \right)$$
$$\mathcal{L}_{\text{CU}} = \log \left( \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1, j\neq i}^{N} \exp \left( -2 || \boldsymbol \mu_i - \boldsymbol \mu_j ||^2_2 \right) \right)$$
$$\mathcal{L}_{\text{CLgap}} = \mathcal{L}_{\text{gap}} + \frac{1}{2} \cdot \left( \mathcal{L}^{(m \to n)} + \mathcal{L}^{(n \to m)} \right) ,\quad \text{where } \mathcal{L}_{\text{gap}} = \mathcal{L}_{\text{ATP}} + \mathcal{L}_{\text{CU}}$$
\( \mathcal{L}_{\text{ATP}} \)
Align true pairs
Pulls matching samples across modalities closer in the space.
\( \mathcal{L}_{\text{CU}} \)
Spread centroids
Prevents collapse by keeping semantic centroids uniform over the hypersphere.
Combined objective
Close the gap
Preserves pairwise alignment while improving the geometry needed for group-wise semantics.
Latent-space visualizations
Semantic results in AV-MNIST dataset.
On AV-MNIST, we show the contrast between initialization, CLIP-based learning, and the proposed method. The key shift is from modality-separated clusters to class-aligned multimodal neighborhoods.
Semantic results for two and three modalities.
PCA visualizations. Initialization begins fragmented, CLIP preserves a visible gap, and the proposed objective produces overlapping modality distributions over the shared space.
Results
MSCOCO gap
0.47 → 0.03
Large reduction in modality separation
MSCOCO V-Measure
12.98 → 23.63
Substantial group-wise improvement
MSCOCO Cosine True Pairs
0.34 → 0.77
True pairs become much more aligned
MSR-VTT gap
0.29 → 0.07
Trimodal latent space becomes tighter
MSR-VTT V-Measure
23.3 → 32.1
Clustering clearly benefits
MSR-VTT T2A R@1
10.3 → 11.8
Retrieval shifts are smaller but preserved
Dataset Metric CLIP (LT) Ours Notes
MSCOCO Gap ↓ 0.47 0.03 Shared latent space is much more coherent.
MSCOCO V-Measure ↑ 12.98 23.63 Group-wise semantics improve strongly.
MSR-VTT Gap ↓ 0.29 0.07 The same effect holds in the trimodal setting.
MSR-VTT V-Measure ↑ 23.3 32.1 Clustering benefits remain clear at larger scale.
Test it in the wild
Retrieval results with text and image query.

A
With gap: similarity remains entangled with modality identity.
B
Without gap: semantic neighborhoods become cross-modal rather than modality-specific.
C
Takeaway: the latent space becomes useful for grouping, browsing, and additional downstream tasks.
BibTeX
@inproceedings{grassucci2026closing, title = {Closing the Modality Gap Aligns Group-Wise Semantics}, author = {Grassucci, Eleonora and Cicchetti, Giordano and Frasca, Emanuele and Uncini, Aurelio and Comminiello, Danilo}, booktitle = {International Conference on Learning Representations}, year = {2026} }