当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dual-model self-regularization and fusion for domain adaptation of robust speaker verification
Speech Communication ( IF 3.2 ) Pub Date : 2023-10-27 , DOI: 10.1016/j.specom.2023.103001
Yibo Duan , Yanhua Long , Jiaen Liang

Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available VoxCeleb2 and VoxMovies datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.



中文翻译:

用于鲁棒说话人验证域适应的双模型自正则化和融合

学习说话人身份的鲁棒表示是说话人验证中的一个关键挑战,因为它可以为许多具有域或说话人内部变化的现实世界说话人验证场景带来良好的泛化。在本研究中,我们的目标是改进完善的 ECAPA-TDNN 框架,以增强其在低资源跨域说话人验证任务中的域鲁棒性。具体来说,首先提出了一种新颖的双模型自学习方法来产生鲁棒的说话人身份嵌入,其中 ECAPA-TDNN 被扩展为双模型结构,然后使用不同中间声学表示之间的自监督学习进行训练和正则化;然后,我们通过以时间相关的方式结合自监督损失和监督损失来增强双模型,从而增强模型的整体泛化能力。此外,为了更好地利用双模型输出中的互补信息,我们探索了各种相似度计算和分数融合的方法。我们在公开的VoxCeleb2VoxMovies数据集上进行的实验表明,我们提出的双模型正则化和融合方法在各种域内和跨域评估集上的 EER 相对降低了 9.07%–11.6%,优于强基线。重要的是,我们的方法在低资源跨域说话人验证任务的监督和无监督场景中都表现出了有效性。

更新日期:2023-10-28
down
wechat
bug