GRAM: Gramian Multimodal Representation Learning and Alignment

Abstract

While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities.

In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the k-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously.

GRAM can replace cosine similarity in any downstream method, holding for 2 to modality and providing more meaningful alignment with respect to previous similarity measures. Moreover, the novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification.

How to compute the volume of a k-dimensional parallelotope?

The idea is super simple. The volume of a k-dimensional parallelotope that starts from the origin and is delimited by k vectors each one of dimension n can be computed by computing the square root of the determinant of the Gram Matrix generated by those vectors.

In our context the vectors are the latent representations extracted by different encoders, one for each modality. If we want to compute the volume of the k-dimensional parallelotope spanned by those modalities as edges we can arrange the embeddings in a tensor A = [v₁,v₂,v₃,...,v_k]; Then we compute the Gram Matrix G = A^TA; We compute the determinant of this matrix and finally the volume is represented by the square root of the determinant.

How to perform retrieval?

Multimodal Text-to-Video retrieval is the task of assign the right textual caption to a set of videos, where videos are intend as set of frames plus audio plus whatever relevant information. This task is useful to directly understand how "well-shaped" is the learned latent space (i.e. to understand if there is alignment among all the modalities in the latent space).

To solve this task using GRAM we compute the volume of the k-dimensional parallelotope formed by the latent representation of a fixed textual description and all the latent representations of the videos. we assign the text description to the video with which the parallelepiped with the smallest volume is formed.

The key concept is the following one: the bigger the volume the more the modalities are semantically dissimilar each others, the lower the volume the lower the modalities are semantically dissimilar each others.

Comparison with a cosine similarity based approach

The results from the previous interactive experiment are summarized in the following confusion matrix. As shown, frameworks that rely solely on two modalities (text-visual or text-audio in this example) often fail to assign the correct label to the appropriate video. GRAM, on the other hand, effectively leverages multiple modalities to produce a more accurate retrieval result by integrating information from all available modalities.

Experimental evidences

In this Section, we present the main results of the proposed GRAM contrastive loss and model in downstream tasks. We test our developed approach on different datasets and tasks. GRAM set new state-of-the-art results in almost all the datasets, highlighting the alignment power of the proposed method. .

BibTeX

@article{cicchetti2024gram,
  author    = {Cicchetti, Giordano and Grassucci, Eleonora and Sigillo, Luigi and Comminiello, Danilo},
  title     = {Gramian Multimodal Representation Learning and Alignment},
  journal   = {arXiv},
  year      = {2024},
}

Gramian Multimodal Representation Learning and Alignment

Abstract

How to compute the volume of a k-dimensional parallelotope?

How to perform retrieval?

Play with the combinations and see how Gramian value change!

Normalized Gramian Value: Select options above

A dog is barking

A dog is howling

A red cat is meowing

A black cat is meowing

Comparison with a cosine similarity based approach

Experimental evidences

BibTeX