Riccardo F. Gramaccioni , Christian Marinoni , Emilian Postolache , Marco Comunità , Luca Cosmo , Joshua D. Reiss , and Danilo Comminiello
Sapienza University of Rome, Italy
Queen Mary University of London, UK
Ca' Foscari University of Venice, Italy
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video.
In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production.
We introduce a new two-stage model for Video-to-Audio (V2A) generation, Stable-V2A, where the two stages are represented by RMS-Mapper and Stable-Foley.
RMS-Mapper is a simple but effective network for mapping an RMS envelope directly from a video; it takes representative features of frames and optical flow of videos as input.
Stable-Foley, the model for synthesizing sound effects, relies on Stable Audio, a state-of-the-art latent diffusion model for audio generation, to obtain the final output. In order to take advantage of the prior knowledge of this model and guide its generation process via the RMS envelope, we leverage a ControlNet, which is a network used to control the diffusion process via extra conditioning.
We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce in this paper a dataset of videos extracted from video games depicting animated characters walking in different locations, called Walking The Maps.
Stable-V2A is able to generate audios that are semantically and temporally aligned to videos.
It all starts from a silent video, which represents the input to our model.
In Stage 1, RMS-Mapper generates an evelope from the frames and optical flow embeddings passed as input (and extracted via TC-CLIP).
In Stage 2, Stable-Foley generates the output audio. Here the semantics is specified by cross-conditioning the latent diffusion model with a latent representation of an audio sample or a text prompt. Moreover, in order to control the process with the RMS envelope estimated in Stage 1, we use a ControlNet.
In the case of the silent video above, we generated 2-sec chunks and we concatenated them to get the full track. The result is here proposed:
Here we propose some examples to appreciate the results obtained by our method and baselines.
You can listen to selected samples generated from audio and text prompts by expanding the corresponding sections.
We use the Greatest Hits dataset, a widely-adopted datasets for V2A tasks.
This dataset includes videos of humans using a drumstick to hit or rub objects or surfaces.
Example
We also use CLAP as semantic conditioning meaning that we can use text as additional modality at inference time.
Example
In this work, we introduce a new dataset named Walking The Maps (WTM).
WTM includes short videos from gameplay sessions from YouTube of 4 video games: Hogwarts Legacy, Zelda Breath of the Wild, Assassin's Creed: Odyssey, Assassin's Creed IV Black Flag.
For each video, we extracted only clips in which the sound of the steps is clearly audible and there are no other possible sound sources in the video that can be related to the target sound.
Example
Results for Stable-V2A and comparison with other SOTA models on Greatest Hits. Table shows whether the model generates the output conditioned on an audio or text prompt; HRC stands for Human Readable Control and refers to the use of time-varying interpretable signals that sound designers can use to control the generation process (i.e., envelope or onsets).
Our model provides the best results, even in the setting of text conditioned generation.
Model | Audio | Text | HRC | FAD-P ↓ | FAD-C ↓ | FAD-LC ↓ | E-LI ↓ | CLAP ↑ | FAVD ↓ |
---|---|---|---|---|---|---|---|---|---|
SpecVQGAN | ❌ | ❌ | ❌ | 99.07 | 1001 | 0.7102 | 0.0427 | 0.1418 | 6.5136 |
Diff-Foley | ❌ | ❌ | ❌ | 85.70 | 654 | 0.4690 | 0.0448 | 0.3733 | 4.6186 |
CondFoleyGen | ✅ | ❌ | ❌ | 74.93 | 650 | 0.4883 | 0.0357 | 0.4879 | 6.4814 |
SyncFusion | ❌ | ✅ | ✅ | 35.64 | 591 | 0.4365 | 0.0231 | 0.5154 | 4.3020 |
✅ | ❌ | ✅ | 27.85 | 542 | 0.2793 | 0.0177 | 0.6621 | 3.2825 | |
Video-Foley | ❌ | ✅ | ✅ | 67.04 | 644 | 0.4997 | 0.0242 | 0.3680 | 4.9106 |
✅ | ❌ | ✅ | 28.45 | 435 | 0.1671 | 0.0183 | 0.6779 | 2.2070 | |
Stable-V2A (Ours) | ❌ | ✅ | ✅ | 32.80 | 381 | 0.2516 | 0.0137 | 0.4806 | 3.9413 |
✅ | ❌ | ✅ | 16.57 | 217 | 0.1048 | 0.0137 | 0.6833 | 2.0264 |
If you found this work useful, please cite us as follows:
@article{gramaccioni2024stable,
title={Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls},
author={Riccardo Fosco Gramaccioni and Christian Marinoni and Emilian Postolache and Marco Comunità and Luca Cosmo and Joshua D. Reiss and Danilo Comminiello},
journal={arXiv preprint arXiv:2412.15023},
year={2024}
}