当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification
arXiv - CS - Sound Pub Date : 2021-01-09 , DOI: arxiv-2101.03329
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature selection ability, which is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary classification task where a neural network (NN) based discriminative learning could be applied. Through discriminative learning, the nuisance features could be removed with the help of label supervision. However, the discriminative learning pays more attention to classification boundaries which is prone to overfitting to training data and yielding poor generalization on testing data. In this paper, we propose a hybrid learning framework, i.e., integrating a joint Bayesian (JB) generative model into a neural discriminative learning framework for SV. A Siamese NN is built with dense layers to approximate the mapping functions used in the SV pipeline with the JB model, and the L-LLR score estimated based on the JB model is connected to the distance metric in a pair-wised discriminative learning. By initializing the Siamese NN with the parameters learned from the JB model, we further train the model parameters with the pair-wised samples as a binary discrimination task. Moreover, direct evaluation metric in SV, i.e., minimum empirical Bayes risk, is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on speakers in the wild (SITW) and Voxceleb corpora. Experimental results showed that our proposed model improved the performance with a large margin compared with state-of-the-art models for SV.

中文翻译:

将贝叶斯联合生成模型集成到判别学习框架中以进行说话人验证

说话者验证(SV)的任务是确定目标说话者或冒名顶替者说出的话语。在大多数SV研究中,对数似然比(L_LLR)得分是基于说话者特征的生成概率模型估算的,并与决策阈值进行比较。然而,生成模型通常关注特征分布,并且不具有区分性特征选择能力,该特征易于被讨厌的特征分散注意力。SV作为假设检验,可以表述为二进制分类任务,其中可以应用基于神经网络(NN)的判别学习。通过判别性学习,可以在标签监督的帮助下删除令人讨厌的功能。然而,判别式学习更多地关注分类边界,因为分类边界容易过分适合训练数据并导致测试数据泛化不佳。在本文中,我们提出了一种混合学习框架,即将贝叶斯(JB)联合生成模型集成到SV的神经判别学习框架中。使用JB模型构建具有密集层的暹罗NN来近似SV管道中使用的映射函数,并且基于JB模型估计的L-LLR分数与成对判别学习中的距离度量相关。通过使用从JB模型中学到的参数来初始化Siamese NN,我们进一步使用成对样本作为二进制判别任务来训练模型参数。此外,SV中的直接评估指标,即最小经验贝叶斯风险,被设计并整合为判别学习中的目标功能。我们对野生(SITW)和Voxceleb语料库的说话者进行了SV实验。实验结果表明,与最新的SV模型相比,我们提出的模型在很大程度上提高了性能。
更新日期:2021-01-12
down
wechat
bug