Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
arXiv - CS - Multimedia Pub Date : 2021-05-03 , DOI: arxiv-2105.00708
Yan-Bo Lin, Yu-Chiang Frank Wang

Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components. By preserving the left-right consistency in both audio and visual modalities, our learning strategy can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth binaural audio data during training. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization.

中文翻译：

利用部分监督开发视听一致性，以产生空间音频

人类通过耳朵听到的独特声音感知到丰富的听觉体验。用双耳音频录制的视频特别模拟了人类如何接收环境声音。但是，大量视频仅带有单声道音频，由于缺少环境信息，这会降低用户体验。为了解决这个问题，我们提出了一种音频空间化框架，以利用音频和视频组件之间的关系将单声道视频转换为双声道视频。通过在视听方式上保持左右一致，我们的学习策略可以看作是一种自我监督的学习技术，可以减轻训练过程中对地面真实双耳音频数据对大量视频数据的依赖性。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文