Stable-V2A: Synchronized Sound Effects Synthesis

Synthesis of Synchronized Audio Effects with Temporal and Semantic Controls

Riccardo F. Gramaccioni , Christian Marinoni , Emilian Postolache , Marco Comunità , Luca Cosmo , Joshua D. Reiss , and Danilo Comminiello

Sapienza University of Rome, Italy
Queen Mary University of London, UK
Ca' Foscari University of Venice, Italy

Introduction

Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video.
In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production.

We introduce a new two-stage model for Video-to-Audio (V2A) generation, Stable-V2A, where the two stages are represented by RMS-Mapper and Stable-Foley.

RMS-Mapper is a simple but effective network for mapping an RMS envelope directly from a video; it takes representative features of frames and optical flow of videos as input.

Stable-Foley, the model for synthesizing sound effects, relies on Stable Audio, a state-of-the-art latent diffusion model for audio generation, to obtain the final output. In order to take advantage of the prior knowledge of this model and guide its generation process via the RMS envelope, we leverage a ControlNet, which is a network used to control the diffusion process via extra conditioning.

We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce in this paper a dataset of videos extracted from video games depicting animated characters walking in different locations, called Walking The Maps.

Model	Audio	Text	HRC	FAD-P ↓	FAD-C ↓	FAD-LC ↓	E-LI ↓	CLAP ↑	FAVD ↓
SpecVQGAN	❌	❌	❌	99.07	1001	0.7102	0.0427	0.1418	6.5136
Diff-Foley	❌	❌	❌	85.70	654	0.4690	0.0448	0.3733	4.6186
CondFoleyGen	✅	❌	❌	74.93	650	0.4883	0.0357	0.4879	6.4814
SyncFusion	❌	✅	✅	35.64	591	0.4365	0.0231	0.5154	4.3020
✅	❌	✅	27.85	542	0.2793	0.0177	0.6621	3.2825
Video-Foley	❌	✅	✅	67.04	644	0.4997	0.0242	0.3680	4.9106
✅	❌	✅	28.45	435	0.1671	0.0183	0.6779	2.2070
Stable-V2A (Ours)	❌	✅	✅	32.80	381	0.2516	0.0137	0.4806	3.9413
✅	❌	✅	16.57	217	0.1048	0.0137	0.6833	2.0264

Model

Audio

Text

HRC

FAD-P ↓

FAD-C ↓

FAD-LC ↓

E-LI ↓

CLAP ↑

FAVD ↓

SpecVQGAN

❌

99.07

1001

0.7102

0.0427

0.1418

6.5136

Diff-Foley

❌

85.70

654

0.4690

0.0448

0.3733

4.6186

CondFoleyGen

✅

❌

74.93

650

0.4883

0.0357

0.4879

6.4814

SyncFusion

❌

✅

35.64

591

0.4365

0.0231

0.5154

4.3020

✅

❌

✅

27.85

542

0.2793

0.0177

0.6621

3.2825

Video-Foley

❌

✅

67.04

644

0.4997

0.0242

0.3680

4.9106

✅

❌

✅

28.45

435

0.1671

0.0183

0.6779

2.2070

Stable-V2A (Ours)

❌

✅

32.80

381

0.2516

0.0137

0.4806

3.9413

✅

❌

✅

16.57

217

0.1048

0.0137

0.6833

2.0264

Cite us

If you found this work useful, please cite us as follows:

@article{gramaccioni2024stable,
title={Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls},
author={Riccardo Fosco Gramaccioni and Christian Marinoni and Emilian Postolache and Marco Comunità and Luca Cosmo and Joshua D. Reiss and Danilo Comminiello},
journal={arXiv preprint arXiv:2412.15023},
year={2024}
}

Stable-V2A

Synthesis of Synchronized Audio Effects with Temporal and Semantic Controls

What's new?

Introduction

How it works

Examples

Greatest Hits

Walking The Maps

Results

Cite us