Knowing What to Listen to: Early Attention for Deep Speech Representation Learning,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
arXiv - CS - Sound Pub Date : 2020-09-03 , DOI: arxiv-2009.01822
Amirhossein Hajavi, Ali Etemad

Deep learning techniques have considerably improved speech processing in recent years. Speech representations extracted by deep learning models are being used in a wide range of tasks such as speech recognition, speaker recognition, and speech emotion recognition. Attention models play an important role in improving deep learning models. However current attention mechanisms are unable to attend to fine-grained information items. In this paper we propose the novel Fine-grained Early Frequency Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition. Two widely used public datasets, VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on top of several prominent deep models as backbone networks to evaluate its impact on performance compared to the original networks and other related work. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, even setting a new state-of-the-art for the speaker recognition task. We also tested our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.

中文翻译：

知道该听什么：对深度语音表示学习的早期关注

近年来，深度学习技术大大改进了语音处理。深度学习模型提取的语音表示被广泛用于语音识别、说话人识别和语音情感识别等任务。注意模型在改进深度学习模型方面发挥着重要作用。然而，当前的注意力机制无法关注细粒度的信息项。在本文中，我们为语音信号提出了新颖的细粒度早期频率注意（FEFA）。该模型能够关注与频率仓一样小的信息项。我们在说话人识别和语音情感识别这两个流行的任务上评估了所提出的模型。我们的实验使用了两个广泛使用的公共数据集 VoxCeleb 和 IEMOCAP。该模型是在作为骨干网络的几个突出的深度模型之上实现的，以评估其与原始网络和其他相关工作相比对性能的影响。我们的实验表明，通过将 FEFA 添加到不同的 CNN 架构中，性能得到显着提高，甚至为说话人识别任务设置了新的最新技术。我们还针对不同级别的添加噪声测试了我们的模型，与主干网络相比，显示出鲁棒性的改进和更低的灵敏度。甚至为说话人识别任务设置了新的最新技术。我们还针对不同级别的添加噪声测试了我们的模型，与主干网络相比，显示出鲁棒性的改进和更低的灵敏度。甚至为说话人识别任务设置了新的最新技术。我们还针对不同级别的添加噪声测试了我们的模型，与主干网络相比，显示出鲁棒性的改进和更低的灵敏度。

更新日期：2020-09-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文