An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales
arXiv - CS - Sound Pub Date : 2020-01-14 , DOI: arxiv-2001.04584
Bin Gu, Wu Guo

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system

中文翻译：

一种改进的深度神经网络，用于在不同时间尺度上对说话人特征进行建模

本文提出了一种改进的基于卷积神经网络 (CNN) 的深度嵌入学习方法，用于文本独立说话人验证。对x向量嵌入学习提出了两个改进：（1）在帧级层中采用多尺度卷积（MSCNN）来捕获不同感受野中的互补说话人信息。(2) 在池化层中应用了 Baum-Welch 统计注意 (BWSA) 机制，可以在时间池化层中集成更有用的长期说话人特征。实验在 NIST SRE16 评估集上进行。结果证明了 MSCNN 的有效性，并表明所提出的 BWSA 可以进一步提高 DNN 嵌入系统的性能

更新日期：2020-01-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文