Masked multi-head self-attention for causal speech enhancement,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Masked multi-head self-attention for causal speech enhancement
Speech Communication ( IF 3.2 ) Pub Date : 2020-10-29 , DOI: 10.1016/j.specom.2020.10.004
Aaron Nicolson , Kuldip K. Paliwal

Accurately modelling the long-term dependencies of noisy speech is critical to the performance of a speech enhancement system. Current deep learning approaches to speech enhancement employ either a recurrent neural network (RNN) or a temporal convolutional network (TCN). However, RNNs and TCNs both demonstrate deficiencies when modelling long-term dependencies. Enter multi-head attention (MHA) — a mechanism that has outperformed both RNNs and TCNs in tasks such as machine translation. By using sequence similarity, MHA possesses the ability to more efficiently model long-term dependencies. Moreover, masking can be employed to ensure that the MHA mechanism remains causal — an attribute critical for real-time processing. Motivated by these points, we investigate a deep neural network (DNN) that utilises masked MHA for causal speech enhancement. The conditions used to evaluate the proposed DNN include real-world non-stationary and coloured noise sources at multiple SNR levels. Our extensive experimental investigation demonstrates that the proposed DNN can produce enhanced speech at a higher quality and intelligibility than both RNNs and TCNs. We conclude that deep learning approaches employing masked MHA are more suited for causal speech enhancement than RNNs and TCNs. Availability—MHANet is available at https://github.com/anicolson/DeepXi

中文翻译：

蒙面多头自我注意，增强因果关系的语音

准确建模嘈杂语音的长期依赖性对于语音增强系统的性能至关重要。当前用于语音增强的深度学习方法采用递归神经网络（RNN）或时间卷积网络（TCN）。但是，在对长期依赖关系进行建模时，RNN和TCN都存在缺陷。输入多头注意力（MHA）-一种在诸如机器翻译之类的任务中胜过RNN和TCN的机制。通过使用序列相似性，MHA可以更有效地对长期依赖性进行建模。此外，可以使用掩蔽来确保MHA机制保持因果关系-这是实时处理的关键属性。基于这些观点，我们研究了利用掩蔽的MHA进行因果语音增强的深度神经网络（DNN）。用于评估提议的DNN的条件包括处于多个SNR级别的实际非平稳噪声和有色噪声源。我们广泛的实验研究表明，与RNN和TCN相比，拟议的DNN可以以更高的质量和清晰度提供增强的语音。我们得出结论，采用隐蔽MHA的深度学习方法比RNN和TCN更适合因果语音增强。可用性-MHANet可在https://github.com/anicolson/DeepXi上获得

更新日期：2020-11-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>