当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An investigation of domain adaptation in speaker embedding space for speaker recognition
Speech Communication ( IF 3.2 ) Pub Date : 2021-01-23 , DOI: 10.1016/j.specom.2021.01.001
Fahimeh Bahmaninezhad , Chunlei Zhang , John H.L. Hansen

Speaker recognition continues to grow as a research challenge in the field with expanded application in commercial, forensic, educational and general speech technology interfaces. However, challenges remain, especially for naturalistic audio streams including recordings with mismatch between train and test data (i.e., when train or system development data and enrollment/test data or application data are collected from different sources). Mismatch conditions (Hansen and Hasan, 2015) can be divided into two categories, extrinsic (channel, noise, etc.) and intrinsic (duration, language, and speaker traits including stress, emotion, Lombard effect, vocal effort, accent). Here, we investigate speaker recognition for the domain mismatch problem (intrinsic mismatch) especially for those challenges introduced by NIST (National Institute of Standards and Technology) SRE (speaker recognition evaluation) in 2016 and 2018. The challenges introduced in NIST SRE-16 and SRE-18 include language mismatch between train (used for the development of the system) and enrollment/test (used at the application phase). Here, we develop three alternative speaker embedding systems; i-vector, t-vector (an improved triplet loss solution), and x-vector. In addition, a number of unsupervised and supervised (using pseudo labels) methods are also studied for domain mismatch compensation, especially applied at the back-end level. These include adapted PLDA, adapted discriminant analysis, as well as score normalization and calibration methods using unlabeled in-domain data. We propose new variations to discriminant analysis with support vectors (SVDA) as well. These results confirm that SVDA can measurably improve speaker recognition performance for SRE-16 and SRE-18 tasks respectively by +15% and +8% in terms of min-Cprimary; and for EER the gains are +14% and +16% respectively, using i-vector speaker embeddings as the baseline. These advancements offer promising steps toward addressing speaker recognition in naturalistic audio streams.



中文翻译:

说话人嵌​​入空间中说话人识别领域的适应性研究

随着商业,法医,教育和通用语音技术接口的广泛应用,说话人识别作为该领域的研究挑战继续增长。但是,挑战仍然存在,特别是对于自然的音频流,包括火车和测试数据之间不匹配的记录(即,火车或系统开发数据以及注册/测试数据或应用程序数据是从不同来源收集的)。不匹配条件(Hansen和Hasan,2015)可分为两类,外在的(渠道,噪音等)和内在的(持续时间,语言和说话者特征,包括压力,情绪,伦巴第效应,嗓音,口音)。这里,我们针对域不匹配问题(本征不匹配),特别是针对NIST(美国国家标准技术研究院)SRE(扬声器识别评估)在2016年和2018年提出的挑战,对说话人识别进行调查。NIST SRE-16和SRE- 18包括培训(用于系统开发)和注册/测试(在应用程序阶段使用)之间的语言不匹配。在这里,我们开发了三种替代的扬声器嵌入系统:i向量,t向量(改进的三重态损失解决方案)和x向量。此外,还研究了许多非监督和监督(使用伪标签)的方法来进行域失配补偿,尤其是在后端级别上。其中包括改编的PLDA,改编的判别分析,以及使用未标记的域内数据的得分归一化和校准方法。我们还为支持向量(SVDA)的判别分析提出了新的变体。这些结果证实,SVDA可以将SRE-16和SRE-18任务的说话人识别性能分别按Min-Cprimary分别提高+ 15%和+ 8%。对于EER,使用i-vector说话者嵌入作为基准,增益分别为+ 14%和+ 16%。这些进步为解决自然音频流中的说话人识别问题提供了有希望的步骤。对于EER,使用i-vector说话者嵌入作为基准,增益分别为+ 14%和+ 16%。这些进步为解决自然音频流中的说话人识别问题提供了有希望的步骤。对于EER,使用i-vector说话者嵌入作为基准,增益分别为+ 14%和+ 16%。这些进步为解决自然音频流中的说话人识别问题提供了有希望的步骤。

更新日期:2021-03-10
down
wechat
bug