HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification
Speech Communication ( IF 2.4 ) Pub Date : 2020-05-06 , DOI: 10.1016/j.specom.2020.03.007
Mohammad Azharuddin Laskar , Rabul Hussain Laskar

This paper builds on a multi-task Deep Neural Network (DNN), which provides an utterance-level feature representation called j-vector, to implement a Text-dependent Speaker Verification (TDSV) system. This technique exploits the speaker idiosyncrasies associated with individual pass-phrases. However, speaker information is known to be characteristic of more specific speech units and, thus, it is likely that important speaker identity traits might get averaged out if it is considered as a coarse entity spread uniformly across the whole pass-phrase. This work attempts to overcome this limitation and devises a technique to leverage the finer speaker traits. It proposes to align the training data for Multi-task DNN using Hierarchical Multi-Layer Acoustic Model (HiLAM). HiLAM is an HMM-based text-dependent model that defines refined segments of a pass-phrase using Gaussian Mixture Model (GMM) states. This helps to exploit the speaker idiosyncrasies associated with finer and more specific segments of speech. Also, as HiLAM is built using the particular text in question, this alignment technique automatically takes care of the exact context of the speech units in the concerned pass-phrase. The proposed technique has been found to improve the performance of the system significantly. Integrating Dynamic Time Warping (DTW) with this technique leads to further improvement in the performance of the system. Experiments have been validated on Part 1 of RSR2015, RedDots, and NITS-TD databases. The best-performing proposed system achieves a relative Equal Error Rate (EER) reduction of up to 50.98% with respect to the baseline j-vector-based system for the overall test condition in case of RSR2015 database.

中文翻译：

动态时间规整框架中的HiLAM状态判别式多任务深度神经网络，用于文本相关的说话人验证

本文建立在多任务深度神经网络（DNN）之上，该神经网络提供了称为j-vector的发声级特征表示，以实现文本相关的说话人验证（TDSV）系统。该技术利用了与各个通行短语相关的说话人特质。但是，说话人信息已知是更特定的语音单元的特征，因此，如果重要的说话人身份特征被认为是在整个密码短语中均匀分布的粗糙实体，则很可能会被平均化。这项工作试图克服这一局限性，并设计出一种技术来利用更好的说话者特征。它建议使用分层多层声学模型（HiLAM）对齐多任务DNN的训练数据。HiLAM是基于HMM的基于文本的模型，该模型使用高斯混合模型（GMM）状态定义密码短语的精炼片段。这有助于利用与更细，更具体的语音片段相关的说话人特质。同样，由于HiLAM是使用相关特定文本构建的，因此这种对齐技术会自动处理相关密码短语中语音单元的确切上下文。已经发现提出的技术显着改善了系统的性能。将动态时间规整（DTW）与该技术集成在一起可以进一步改善系统性能。实验已在RSR2015，RedDots和NITS-TD数据库的第1部分中得到验证。建议的性能最佳的系统可将相对平均错误率（EER）降低多达50。

更新日期：2020-05-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11