MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement
arXiv - CS - Sound Pub Date : 2021-01-15 , DOI: arxiv-2101.05975
Xinmeng Xu, Dongxiang Xu, Jie Jia, Yang Wang, Binbin Chen

The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual featureslayer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6% to 24% on test sets over the audio modalityalone, depending on audio noise level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15 dB SNR experiment.

中文翻译：

MFFCN：用于视听语音增强的多层特征融合卷积网络

语音增强的目的是从多种来源产生的声音混合中提取目标语音信号。语音增强可以潜在地受益于来自目标说话者的视觉信息，例如嘴唇移动和面部表情，因为语音的视觉方面基本上不受声学环境的影响。为了融合视听信息，提出了一种视听融合策略，该策略超越了简单的特征串联，并且学会了自动对齐这两种模态，从而产生了更强大的表示能力，从而提高了在嘈杂条件下的清晰度。提出的模型将视听特征逐层融合，并将这些视听特征馈送到每个对应的解码层。实验结果表明，相对于单独的音频模态，测试集的测试集相对改善了6％至24％，具体取决于音频噪声水平。此外，在我们的-15 dB SNR实验中，PESQ从1.21显着增加到2.06。

更新日期：2021-01-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文