Self-segmentation of pass-phrase utterances for deep feature learning in text-dependent speaker verification,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-segmentation of pass-phrase utterances for deep feature learning in text-dependent speaker verification
Computer Speech & Language ( IF 3.1 ) Pub Date : 2021-04-22 , DOI: 10.1016/j.csl.2021.101229
Achintya Kumar Sarkar , Zheng-Hua Tan

In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

中文翻译：

在基于文本的说话人验证中，用于深度特征学习的密码短语发声的自我分段

在本文中，我们提出了一种用于对通行短语进行分段和标签的新方法，以训练用于依赖文本的说话人验证（TD-SV）的深度神经网络（DNN）瓶颈（BN）功能。具体而言，首先使用与评估无关的通行话语训练单音电话的性别相关隐马尔可夫模型（HMM）。接下来，对经过培训的HMM进行说话人自适应，然后将其用于在电话级别上对这些培训话语进行细分和标记。随后将得到的标记数据用于训练DNN模型，以区分性别相关的电话，以提取区分电话的BN特征。这与应用通用，独立于说话者的自动语音识别（ASR）系统来生成分段和标签的常规方法相反。所提出的方法消除了对单独的ASR系统的需要，该系统还具有在语言，方言，领域，声学条件等方面与口令短语发声不匹配的缺点。使用短话语，高斯混合模型-通用背景模型和i-vector技术对TD-SV的RedDots Challenge 2016数据库进行了实验。实验结果表明，与一组现有方法相比，该方法在TD-SV中产生的错误率更低。彻底的消融研究进一步证实了该方法的有效性。分数和特征水平上的融合也显示出所提出特征的互补性。在语言，方言，领域，声学条件等方面，这还具有与密码短语发声不匹配的缺点。使用短话语，高斯混合模型-通用背景模型和i-vector技术对TD-SV的RedDots Challenge 2016数据库进行了实验。实验结果表明，与一组现有方法相比，该方法在TD-SV中产生的错误率更低。彻底的消融研究进一步证实了该方法的有效性。分数和特征水平上的融合也显示出所提出特征的互补性。在语言，方言，领域，声学条件等方面，这还具有与密码短语发声不匹配的缺点。使用短话语，高斯混合模型-通用背景模型和i-vector技术对TD-SV的RedDots Challenge 2016数据库进行了实验。实验结果表明，与一组现有方法相比，该方法在TD-SV中产生的错误率更低。彻底的消融研究进一步证实了该方法的有效性。分数和特征水平上的融合也显示出所提出特征的互补性。使用短话语，高斯混合模型-通用背景模型和i-vector技术对TD-SV的RedDots Challenge 2016数据库进行了实验。实验结果表明，与一组现有方法相比，该方法在TD-SV中产生的错误率更低。彻底的消融研究进一步证实了该方法的有效性。分数和特征水平上的融合也显示出所提出特征的互补性。使用短话语，高斯混合模型-通用背景模型和i-vector技术对TD-SV的RedDots Challenge 2016数据库进行了实验。实验结果表明，与一组现有方法相比，该方法在TD-SV中产生的错误率更低。彻底的消融研究进一步证实了该方法的有效性。分数和特征水平上的融合也显示了所提出特征的互补性。

更新日期：2021-04-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文