Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network,IEEE Journal of Selected Topics in Signal Processing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network
IEEE Journal of Selected Topics in Signal Processing ( IF 7.5 ) Pub Date : 2020-03-01 , DOI: 10.1109/jstsp.2020.2987209
Ke Tan , Yong Xu , Shi-Xiong Zhang , Meng Yu , Dong Yu

Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.

中文翻译：

两阶段多模态网络的视听语音分离和去混响

背景噪声、干扰语音和房间混响在真实聆听环境中经常会扭曲目标语音。在这项研究中，我们解决了联合语音分离和去混响，旨在将目标语音与背景噪声、干扰语音和房间混响分开。为了解决这个根本性的难题，我们提出了一种利用音频和视觉信号的新型多模态网络。所提出的网络架构采用两阶段策略，其中第一阶段采用分离模块衰减背景噪声和干扰语音，第二阶段采用去混响模块抑制房间混响。这两个模块首先分别训练，然后整合进行联合训练，这是基于一个新的多目标损失函数。我们的实验结果表明，与几个单阶段和两阶段基线相比，所提出的多模态网络始终产生更好的客观可理解性和感知质量。我们发现，与未处理的混合物相比，我们的网络实现了 21.10% 的 ESTOI 改进和 0.79 的 PESQ 改进。此外，我们的网络架构不需要了解说话者的数量。

更新日期：2020-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>