Long-Term Rhythmic Video Soundtracker
Jiashuo Yu
Yaohui Wang
Xinyuan Chen
Xiao Sun
Yu Qiao
Shanghai Artificial Intelligence Laboratory
International Conference on Machine Learning (ICML) 2023

[arXiv] [Code] [Dataset]




"A cute loris, flat design, anime style." -- Stable Diffusion XL



Abstract

We consider the problem of generating musical soundtracks in sync with rhythmic visual cues. Most existing works rely on pre-defined music representations, leading to the incompetence of generative flexibility and complexity. Other methods directly generating video-conditioned waveforms suffer from limited scenarios, short lengths, and unstable generation quality. To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms. Specifically, our framework consists of a latent conditional diffusion probabilistic model to perform waveform synthesis. Furthermore, a series of context-aware conditioning encoders are proposed to take temporal information into consideration for a long-term generation. Notably, we extend our model's applicability from dances to multiple sports scenarios such as floor exercise and figure skating. To perform comprehensive evaluations, we establish a benchmark for rhythmic video soundtracks including the pre-processed dataset, improved evaluation metrics, and robust generative baselines. Extensive experiments show that our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence.



Generated Examples

Cherry-Picked 25/50 seconds soundtracks of dance, floor exercise, and figure skating videos.
Dancing (25s)

Floor Exercise (25s, 50s)

Figure Skating (25s, 50s)



Methodology


The pipeline of our proposed method.



Dataset


Comparison between our dataset and existing video soundtrack dataset. Statistical results show our dataset involves more video-music pairs with more categories and longer lengths.




Citation

@inproceedings{Yu2023Long,
title={Long-Term Rhythmic Video Soundtracker},
author={Yu, Jiashuo and Wang, Yaohui and Chen, Xinyuan and Sun, Xiao and Qiao, Yu },
booktitle={International Conference on Machine Learning (ICML)},
year={2023}
}
				



Acknowledgements

This work is partially supported by the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100). This work was supported in part by the National Natural Science Foundation of China under Grants 62102150. Thanks for Peng Wu for sharing the amazing template.



Contact

For further questions and suggestions, please contact Jiashuo Yu (yujiashuo@pjlab.org.cn).