Stable-V2A

Synthesis of Synchronized Audio Effects with Temporal and Semantic Controls

Riccardo F. Gramaccioni , Christian Marinoni  , Emilian Postolache , Marco Comunità  , Luca Cosmo , Joshua D. Reiss  , and Danilo Comminiello 

 Sapienza University of Rome, Italy
 Queen Mary University of London, UK
 Ca' Foscari University of Venice, Italy

What's new?

  • A two-stage model consisting of: a RMS-Mapper that estimates the envelope given an input video; and Stable-Foley, a diffusion model that generates an audio semantically and temporally aligned with the target video.
  • Temporal alignment is guaranteed by the use of the envolope as an input to the ControlNet, while semantic alignment is achieved as cross-attention conditioning of the diffusion process.
  • We test our model on Greatest Hits and we introduce a brand-new dataset named Walking The Maps.

Introduction

Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. 
In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production.

We introduce a new two-stage model for Video-to-Audio (V2A) generation, Stable-V2A, where the two stages are represented by RMS-Mapper and Stable-Foley.

RMS-Mapper is a simple but effective network for mapping an RMS envelope directly from a video; it takes representative features of frames and optical flow of videos as input.

Stable-Foley, the model for synthesizing sound effects, relies on Stable Audio, a state-of-the-art latent diffusion model for audio generation, to obtain the final output. In order to take advantage of the prior knowledge of this model and guide its generation process via the RMS envelope, we leverage a ControlNet, which is a network used to control the diffusion process via extra conditioning. 

We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce in this paper a dataset of videos extracted from video games depicting animated characters walking in different locations, called Walking The Maps.

How it works

Stable-V2A is able to generate audios that are semantically and temporally aligned to videos.

It all starts from a silent video, which represents the input to our model.

In Stage 1, RMS-Mapper generates an evelope from the frames and optical flow embeddings passed as input (and extracted via TC-CLIP).

In Stage 2, Stable-Foley generates the output audio. Here the semantics is specified by cross-conditioning the latent diffusion model with a latent representation of an audio sample or a text prompt. Moreover, in order to control the process with the RMS envelope estimated in Stage 1, we use a ControlNet

In the case of the silent video above, we generated 2-sec chunks and we concatenated them to get the full track. The result is here proposed:

Examples

Here we propose some examples to appreciate the results obtained by our method and baselines.

You can listen to selected samples generated from audio  and text prompts  by expanding the corresponding sections.

Greatest Hits

We use the Greatest Hits dataset, a widely-adopted datasets for V2A tasks.
This dataset includes videos of humans using a drumstick to hit or rub objects or surfaces.

Example  

We also use CLAP as semantic conditioning meaning that we can use text as additional modality at inference time.

Example    

Walking The Maps

In this work, we introduce a new dataset named Walking The Maps (WTM).

WTM includes short videos from gameplay sessions from YouTube of 4 video games: Hogwarts Legacy, Zelda Breath of the Wild, Assassin's Creed: Odyssey, Assassin's Creed IV Black Flag.

For each video, we extracted only clips in which the sound of the steps is clearly audible and there are no other possible sound sources in the video that can be related to the target sound.

Example    

Results

Results for Stable-V2A and comparison with other SOTA models on Greatest Hits. Table shows whether the model generates the output conditioned on an audio or text prompt; HRC stands for Human Readable Control and refers to the use of time-varying interpretable signals that sound designers can use to control the generation process (i.e., envelope or onsets).

Our model provides the best results, even in the setting of text conditioned generation.

Model Audio Text HRC FAD-P ↓ FAD-C ↓ FAD-LC ↓ E-LI ↓ CLAP ↑ FAVD ↓
SpecVQGAN 99.07 1001 0.7102 0.0427 0.1418 6.5136
Diff-Foley 85.70 654 0.4690 0.0448 0.3733 4.6186
CondFoleyGen 74.93 650 0.4883 0.0357 0.4879 6.4814
SyncFusion 35.64 591 0.4365 0.0231 0.5154 4.3020
27.85 542 0.2793 0.0177 0.6621 3.2825
Video-Foley 67.04 644 0.4997 0.0242 0.3680 4.9106
28.45 435 0.1671 0.0183 0.6779 2.2070
Stable-V2A (Ours) 32.80 381 0.2516 0.0137 0.4806 3.9413
16.57 217 0.1048 0.0137 0.6833 2.0264

Cite us

If you found this work useful, please cite us as follows:

@article{gramaccioni2024stable,
 title={Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls},
 author={Riccardo Fosco Gramaccioni and Christian Marinoni and Emilian Postolache and Marco Comunità and Luca Cosmo and Joshua D. Reiss and Danilo Comminiello},
 journal={arXiv preprint arXiv:2412.15023},
 year={2024}
}