A New Time–Frequency Attention Tensor Network for Language Identification,Circuits, Systems, and Signal Processing

当前位置： X-MOL 学术 › Circuits Syst. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A New Time–Frequency Attention Tensor Network for Language Identification
Circuits, Systems, and Signal Processing ( IF 1.8 ) Pub Date : 2019-10-28 , DOI: 10.1007/s00034-019-01286-9
Xiaoxiao Miao , Ian McLoughlin , Yonghong Yan

In this paper, we aim to improve traditional DNN x-vector language identification performance by employing wide residual networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network. Compared with conventional time attention, our method learns discriminative weights for different frequency bands to generate weighted means and standard deviations for utterance-level classification. This mechanism enables the architecture to direct attention to important frequency bands rather than important time frames, as in traditional time attention methods. Furthermore, we then introduce a cross-layer frequency attention tensor network (CLF-ATN) which exploits information from different layers to recapture frame-level language characteristics that have been dropped by aggressive frequency pooling in lower layers. This effectively restores fine-grained discriminative language details. Finally, we explore the joint fusion of frame-level and frequency-band attention in a time–frequency attention network. Experimental results show that firstly, WRN can significantly outperform a traditional DNN x-vector implementation; secondly, the proposed frequency attention method is more effective than time attention; and thirdly, frequency–time score fusion can yield further improvement. Finally, extensive experiments on CLF-ATN demonstrate that it is able to improve discrimination by regaining dropped fine-grained frequency information, particularly for low-dimension frequency features.

中文翻译：

一种用于语言识别的新时频注意张量网络

在本文中，我们旨在通过采用宽残差网络 (WRN) 作为强大的特征提取器，将其与新的频率注意网络相结合来提高传统的 DNN x 向量语言识别性能。与传统的时间注意力相比，我们的方法学习了不同频段的判别权重，以生成用于话语级别分类的加权均值和标准差。这种机制使架构能够将注意力集中在重要的频段上，而不是像传统的时间注意力方法那样关注重要的时间范围。此外，我们然后引入了一个跨层频率注意张量网络（CLF-ATN），它利用来自不同层的信息来重新捕获被较低层的积极频率池丢弃的帧级语言特征。这有效地恢复了细粒度的判别语言细节。最后，我们探索了时频注意力网络中帧级和频带注意力的联合融合。实验结果表明，首先，WRN 可以显着优于传统的 DNN x-vector 实现；其次，提出的频率注意力方法比时间注意力更有效；第三，频率-时间分数融合可以产生进一步的改进。最后，在 CLF-ATN 上的大量实验表明，它能够通过重新获得丢弃的细粒度频率信息来提高辨别力，特别是对于低维频率特征。实验结果表明，首先，WRN 可以显着优于传统的 DNN x-vector 实现；其次，提出的频率注意力方法比时间注意力更有效；第三，频率-时间分数融合可以产生进一步的改进。最后，在 CLF-ATN 上的大量实验表明，它能够通过重新获得丢弃的细粒度频率信息来提高辨别力，特别是对于低维频率特征。实验结果表明，首先，WRN 可以显着优于传统的 DNN x-vector 实现；其次，提出的频率注意力方法比时间注意力更有效；第三，频率-时间分数融合可以产生进一步的改进。最后，在 CLF-ATN 上的大量实验表明，它能够通过重新获得丢弃的细粒度频率信息来提高辨别力，特别是对于低维频率特征。

更新日期：2019-10-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文