当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generating Visually Aligned Sound from Videos.
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2020-07-28 , DOI: 10.1109/tip.2020.3009820
Peihao Chen , Yang Zhang , Mingkui Tan , Hongdong Xiao , Deng Huang , Chuang Gan

We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated outside a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds. To address this challenge, we propose a framework named RegNet. In this framework, we first extract appearance and motion features from video frames to better distinguish the object that emits sound from complex background information. We then introduce an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features. Using both visual and bottlenecked sound features for sound prediction during training provides stronger supervision for the sound prediction. The audio forwarding regularizer can control the irrelevant sound component and thus prevent the model from learning an incorrect mapping between video frames and sound emitted by the object that is out of the screen. During testing, the audio forwarding regularizer is removed to ensure that RegNet can produce purely aligned sound only from visual features. Extensive evaluations based on Amazon Mechanical Turk demonstrate that our method significantly improves both temporal and content-wise alignment. Remarkably, our generated sound can fool the human with a 68.12% success rate. Code and pre-trained models are publicly available at https://github.com/PeihaoChen/regnet .

中文翻译:


从视频中生成视觉上一致的声音。



我们专注于从自然视频中生成声音的任务,并且声音应该在时间和内容上与视觉信号保持一致。这项任务极具挑战性,因为会产生一些声音外部无法从视频内容推断出摄像机。该模型可能被迫学习视觉内容和这些不相关的声音之间的错误映射。为了应对这一挑战,我们提出了一个名为 RegNet 的框架。在这个框架中,我们首先从视频帧中提取外观和运动特征,以更好地将发出声音的对象与复杂的背景信息区分开。然后,我们引入了一种创新的音频转发正则器,它直接将真实声音视为输入并输出瓶颈声音特征。在训练期间使用视觉和瓶颈声音特征进行声音预测可以为声音预测提供更强的监督。音频转发正则器可以控制不相关的声音分量,从而防止模型学习视频帧和屏幕外对象发出的声音之间的错误映射。在测试过程中,音频转发正则化器被移除,以确保 RegNet 可以仅根据视觉特征产生纯粹对齐的声音。基于 Amazon Mechanical Turk 的广泛评估表明,我们的方法显着改善了时间和内容方面的对齐。值得注意的是,我们生成的声音可以欺骗人类,成功率为 68.12%。代码和预训练模型可在以下位置公开获取: https://github.com/PeihaoChen/regnet 。
更新日期:2020-08-19
down
wechat
bug