当前位置:
X-MOL 学术
›
arXiv.cs.MM
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos
arXiv - CS - Multimedia Pub Date : 2021-07-20 , DOI: arxiv-2107.09262 Sanchita Ghose, John J. Prevost
arXiv - CS - Multimedia Pub Date : 2021-07-20 , DOI: arxiv-2107.09262 Sanchita Ghose, John J. Prevost
Deep learning based visual to sound generation systems essentially need to be
developed particularly considering the synchronicity aspects of visual and
audio features with time. In this research we introduce a novel task of guiding
a class conditioned generative adversarial network with the temporal visual
information of a video input for visual to sound generation task adapting the
synchronicity traits between audio-visual modalities. Our proposed FoleyGAN
model is capable of conditioning action sequences of visual events leading
towards generating visually aligned realistic sound tracks. We expand our
previously proposed Automatic Foley dataset to train with FoleyGAN and evaluate
our synthesized sound through human survey that shows noteworthy (on average
81\%) audio-visual synchronicity performance. Our approach also outperforms in
statistical experiments compared with other baseline models and audio-visual
datasets.
中文翻译:
FoleyGAN:无声视频中基于视觉引导的生成对抗网络同步声音生成
基本上需要开发基于深度学习的视觉到声音生成系统,特别是考虑到视觉和音频特征随时间的同步性方面。在这项研究中,我们引入了一项新任务,即利用视频输入的时间视觉信息引导类条件生成对抗网络,用于适应视听模式之间的同步性特征的视觉到声音生成任务。我们提出的 FoleyGAN 模型能够调节视觉事件的动作序列,从而生成视觉对齐的逼真音轨。我们扩展了我们之前提出的 Automatic Foley 数据集以使用 FoleyGAN 进行训练,并通过人类调查评估我们的合成声音,该调查显示出值得注意的(平均 81\%)视听同步性能。
更新日期:2021-07-21
中文翻译:
FoleyGAN:无声视频中基于视觉引导的生成对抗网络同步声音生成
基本上需要开发基于深度学习的视觉到声音生成系统,特别是考虑到视觉和音频特征随时间的同步性方面。在这项研究中,我们引入了一项新任务,即利用视频输入的时间视觉信息引导类条件生成对抗网络,用于适应视听模式之间的同步性特征的视觉到声音生成任务。我们提出的 FoleyGAN 模型能够调节视觉事件的动作序列,从而生成视觉对齐的逼真音轨。我们扩展了我们之前提出的 Automatic Foley 数据集以使用 FoleyGAN 进行训练,并通过人类调查评估我们的合成声音,该调查显示出值得注意的(平均 81\%)视听同步性能。