当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos
arXiv - CS - Sound Pub Date : 2021-07-20 , DOI: arxiv-2107.09262
Sanchita Ghose, John J. Prevost

Deep learning based visual to sound generation systems essentially need to be developed particularly considering the synchronicity aspects of visual and audio features with time. In this research we introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task adapting the synchronicity traits between audio-visual modalities. Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks. We expand our previously proposed Automatic Foley dataset to train with FoleyGAN and evaluate our synthesized sound through human survey that shows noteworthy (on average 81\%) audio-visual synchronicity performance. Our approach also outperforms in statistical experiments compared with other baseline models and audio-visual datasets.

中文翻译:

FoleyGAN:无声视频中基于视觉引导的生成对抗网络同步声音生成

基本上需要开发基于深度学习的视觉到声音生成系统,特别是考虑到视觉和音频特征随时间的同步性方面。在这项研究中,我们引入了一项新任务,即利用视频输入的时间视觉信息引导类条件生成对抗网络,用于适应视听模式之间的同步性特征的视觉到声音生成任务。我们提出的 FoleyGAN 模型能够调节视觉事件的动作序列,从而生成视觉对齐的逼真音轨。我们扩展了我们之前提出的 Automatic Foley 数据集以使用 FoleyGAN 进行训练,并通过人类调查评估我们的合成声音,该调查显示出值得注意的(平均 81\%)视听同步性能。
更新日期:2021-07-21
down
wechat
bug