当前位置: X-MOL 学术EURASIP J. Adv. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Text-independent speaker recognition based on adaptive course learning loss and deep residual network
EURASIP Journal on Advances in Signal Processing ( IF 1.7 ) Pub Date : 2021-07-23 , DOI: 10.1186/s13634-021-00762-2
Qinghua Zhong 1, 2 , Ruining Dai 1 , Han Zhang 1 , Yongsheng Zhu 1 , Guofu Zhou 2
Affiliation  

Text-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.



中文翻译:

基于自适应课程学习损失和深度残差网络的文本无关说话人识别

与文本无关的说话人识别广泛应用于身份识别,其应用范围很广,例如刑侦、支付认证和基于兴趣的客户服务。为了提高对数滤波器组特征向量的识别能力,本文提出了一种基于深度残差网络模型的文本无关说话人识别方法。深度残差网络由残差网络 (ResNet) 和卷积注意统计池 (CASP) 层组成。CASP 层可以将来自 ResNet 的帧级特征聚合为话语级特征。使用深度残差网络为每个说话者提取语音特征是一个很有前景的探索方向,一个直接的解决方案是使用基于边距的损失函数来训练判别性特征提取网络。然而,基于边际的损失函数往往有一定的局限性,例如不同类别之间的边际被设置为相同和固定。因此,我们使用自适应课程学习损失 (ACLL) 来解决该问题,并为此问题引入了两种不同的基于边际的损失,即 AM-Softmax 和 AAM-Softmax。将所提出的方法应用于大规模的 VoxCeleb2 数据集进行广泛的文本独立说话人识别实验,平均等错误率 (EER) 在 VoxCeleb1 测试数据集上可以达到 1.76%,在 VoxCeleb1-E 测试数据集上可以达到 1.91% 和 3.24%在 VoxCeleb1-H 测试数据集上。与相关说话人识别方法相比,在 VoxCeleb1 测试数据集上的 EER 提高了 1.11%,1。

更新日期:2021-07-23
down
wechat
bug