Multi-modal Multi-channel Target Speech Separation,IEEE Journal of Selected Topics in Signal Processing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-modal Multi-channel Target Speech Separation
IEEE Journal of Selected Topics in Signal Processing ( IF 7.5 ) Pub Date : 2020-03-01 , DOI: 10.1109/jstsp.2020.2980956
Rongzhi Gu , Shi-Xiong Zhang , Yong Xu , Lianwu Chen , Yuexian Zou , Dong Yu

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

中文翻译：

多模态多通道目标语音分离

目标语音分离是指从同时讲话者的重叠音频中提取目标讲话者的语音。以前，使用视觉模态进行目标语音分离已显示出巨大的潜力。这项工作通过利用目标说话者的所有可用信息，包括他/她的空间位置、语音特征和嘴唇运动，提出了一个用于目标语音分离的通用多模态框架。此外，在此框架下，我们研究了多模态联合建模的融合方法。提出了一种基于分解注意力的融合方法，以在嵌入级别聚合多模态的高级语义信息。该方法首先将混合音频分解为一组声学子空间，然后利用目标' s 来自其他模态的信息，以通过可学习的注意方案增强这些子空间声学嵌入。为了在实际场景中验证所提出的多模态分离模型的鲁棒性，在其中一种模态暂时丢失、无效或损坏的情况下对系统进行了评估。实验是在从 YouTube（即将发布）收集的大规模视听数据集上进行的，该数据集通过模拟房间脉冲响应 (RIR) 进行空间化。实验结果表明，我们提出的多模态框架明显优于单模态和双模态语音分离方法，同时仍然可以支持实时处理。该系统是在其中一种方式暂时缺失、无效或损坏的情况下进行评估的。实验是在从 YouTube（即将发布）收集的大规模视听数据集上进行的，该数据集通过模拟房间脉冲响应 (RIR) 进行空间化。实验结果表明，我们提出的多模态框架明显优于单模态和双模态语音分离方法，同时仍然可以支持实时处理。该系统是在其中一种方式暂时缺失、无效或损坏的情况下进行评估的。实验是在从 YouTube（即将发布）收集的大规模视听数据集上进行的，该数据集通过模拟房间脉冲响应 (RIR) 进行空间化。实验结果表明，我们提出的多模态框架明显优于单模态和双模态语音分离方法，同时仍然可以支持实时处理。

更新日期：2020-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>