Introducing phonetic information to speaker embedding for speaker verification,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Introducing phonetic information to speaker embedding for speaker verification
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2019-12-01 , DOI: 10.1186/s13636-019-0166-8
Yi Liu , Liang He , Jia Liu , Michael T. Johnson

Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

中文翻译：

将语音信息引入说话人嵌入以进行说话人验证

语音信息是语音信号最重要的组成部分之一，在许多语音处理任务中发挥着重要作用。然而，很难将语音信息集成到说话人验证系统中，因为它主要发生在帧级别，而说话人特征通常位于段级别。在基于深度神经网络的说话人验证中，现有方法仅将语音信息应用于逐帧训练的说话人嵌入。为了改善这个弱点，本文提出了语音自适应和混合多任务学习，并将它们进一步组合成 c-vector 和简化的 c-vector 架构。美国国家标准与技术研究院 (NIST) 2010 年说话人识别评估 (SRE) 的实验表明，四个提议的说话人嵌入实现了比基线更好的性能。c-vector 系统的性能最好，在内核扩展和 10 s–10 s 条件下，等错误率 (EER) 分别提高了 30% 和 15% 以上。在 NIST SRE 2016、2018 和 VoxCeleb 数据集上，即使训练集内或训练集和评估集之间存在语言不匹配，所提出的 c-vector 方法也能提高性能。大量的实验结果证明了所提出方法的有效性和鲁棒性。为核心扩展和 10 s–10 s 条件分别提供超过 30% 和 15% 的等错误率 (EER) 相对改进。在 NIST SRE 2016、2018 和 VoxCeleb 数据集上，即使训练集内或训练集和评估集之间存在语言不匹配，所提出的 c-vector 方法也能提高性能。大量的实验结果证明了所提出方法的有效性和鲁棒性。为核心扩展和 10 s–10 s 条件分别提供超过 30% 和 15% 的等错误率 (EER) 相对改进。在 NIST SRE 2016、2018 和 VoxCeleb 数据集上，即使训练集内或训练集和评估集之间存在语言不匹配，所提出的 c-vector 方法也能提高性能。大量的实验结果证明了所提出方法的有效性和鲁棒性。

更新日期：2019-12-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文