DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization,Neural Networks

当前位置： X-MOL 学术 › Neural Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization
Neural Networks ( IF 7.8 ) Pub Date : 2020-11-11 , DOI: 10.1016/j.neunet.2020.10.003
Ruihan Hu , Songbing Zhou , Zhi Ri Tang , Sheng Chang , Qijun Huang , Yisen Liu , Wei Han , Edmond Q. Wu

Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks.

中文翻译：

DMMAN：用于声音分离和事件定位的两阶段视听融合框架

视频被广泛用作人类接触世界物理变化的媒体平台。但是，我们始终会从多个声音对象接收混合声音，因此无法将声音作为视频中的单独实体进行区分和定位。为了解决这个问题，本文建立了一个名为“深度多模态注意力网络”的模型来对无约束的视频数据集进行建模，以进一步完成声源分离和事件定位任务。基于多模式分离器和多模式匹配分类器模块，我们的模型着重于使用声音和视觉特征的两阶段融合进行声音分离和模式同步问题。要链接多模式分隔符和多模式匹配分类器模块，利用回归损失和分类损失来建立DMMAN的损失函数。DMMAN计算出的估计频谱蒙版和注意力同步得分可以轻松地推广到声源和事件定位任务。定量实验结果表明，DMMAN不仅可以分离通过信号失真比和信号干扰比指标评估的高质量声源，而且还适用于从未一起听过的混合声音场景。同时，对于事件定位任务，DMMAN比其他对比基线具有更好的分类准确性。定量实验结果表明，DMMAN不仅可以分离通过信号失真比和信号干扰比指标评估的高质量声源，而且还适用于从未一起听过的混合声音场景。同时，对于事件定位任务，DMMAN比其他对比基线具有更好的分类准确性。定量实验结果表明，DMMAN不仅可以分离通过信号失真比和信号干扰比指标评估的高质量声源，而且还适用于从未一起听过的混合声音场景。同时，对于事件定位任务，DMMAN比其他对比基线具有更好的分类准确性。

更新日期：2020-11-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>