当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network
Speech Communication ( IF 3.2 ) Pub Date : 2023-12-14 , DOI: 10.1016/j.specom.2023.103024
Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang

Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.



中文翻译:

使用基于听觉启发的掩蔽调制编码器的卷积注意网络进行鲁棒的语音活动检测

深度学习通过提供有前途的解决方案彻底改变了语音活动检测 (VAD)。然而,通常直接将传统特征(例如原始波形和梅尔频率倒谱系数)应用于深层神经网络由于噪声干扰导致 VAD 性能下降。相比之下,人类在复杂、嘈杂的环境中拥有卓越的语音识别能力,这促使我们从人类听觉系统中汲取灵感。我们提出了一种鲁棒的 VAD 算法,称为基于听觉启发的掩蔽调制编码器的卷积注意网络 (AMME-CANet),它将我们的 AMME 与 CANet 集成在一起。首先,我们研究了作为深度学习编码器(AME)的听觉调制特征的设计,有效地模拟声音信号传输到内耳毛细胞以及随后由神经细胞进行调制过滤的过程。其次,基于在人类听觉系统中观察到的掩蔽效应,我们通过合并产生 AMME 的掩蔽机制来增强我们的听觉调制编码器。 AMME 放大更干净的语音频率,同时抑制噪声成分。第三,受人类听觉机制的启发并利用上下文信息,我们利用注意力机制进行 VAD。该方法使用注意力机制为包含更丰富、信息量更大的线索的上下文信息分配更高的权重。通过大量的实验和评估,我们展示了 AMME-CANet 在具有挑战性的噪声条件下增强 VAD 的优越性能。

更新日期:2023-12-18
down
wechat
bug