A speaker verification backend with robust performance across conditions,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A speaker verification backend with robust performance across conditions
Computer Speech & Language ( IF 4.3 ) Pub Date : 2021-07-01 , DOI: 10.1016/j.csl.2021.101258
Luciana Ferrer , Mitchell McLaren , Niko Brümmer

In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential — the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions.

中文翻译：

说话人验证后端在各种条件下具有强大的性能

在本文中，我们解决了在开发过程中看不见或未知的情况下说话人验证的问题。说话人验证的标准方法包括使用深度神经网络提取说话人嵌入，并通过由概率线性判别分析 (PLDA) 和全局逻辑回归分数校准组成的后端处理它们。众所周知，这种方法会导致系统在与用于训练校准模型的条件不同的条件下工作不佳。我们建议修改标准后端，引入自适应校准器，该校准器使用持续时间和其他自动提取的辅助信息来适应输入条件。后端经过有区别的训练以优化二元交叉熵。当在许多仅针对说话者标记的不同数据集上进行训练时，与标准 PLDA 方法相比，所提出的后端始终如一，并且在某些情况下显着改善了对许多保留数据集的校准，其中一些与训练数据明显不同。辨别性能也不断提高。我们表明，PLDA 和自适应校准器的联合训练是必不可少的——在冻结 PLDA 和微调校准器时无法实现相同的好处。据我们所知，本文的结果是文献中的第一个证据，表明可以开发出在各种条件下具有强大的开箱即用性能的说话人验证系统。与标准 PLDA 方法相比，在许多保留的数据集上，其中一些与训练数据明显不同。辨别性能也不断提高。我们表明，PLDA 和自适应校准器的联合训练是必不可少的——在冻结 PLDA 和微调校准器时无法实现相同的好处。据我们所知，本文的结果是文献中的第一个证据，表明可以开发出在各种条件下具有强大的开箱即用性能的说话人验证系统。与标准 PLDA 方法相比，在许多保留的数据集上，其中一些与训练数据明显不同。辨别性能也不断提高。我们表明，PLDA 和自适应校准器的联合训练是必不可少的——在冻结 PLDA 和微调校准器时无法实现相同的好处。据我们所知，本文的结果是文献中的第一个证据，表明可以开发出在各种条件下具有强大的开箱即用性能的说话人验证系统。我们表明，PLDA 和自适应校准器的联合训练是必不可少的——在冻结 PLDA 和微调校准器时无法实现相同的好处。据我们所知，本文的结果是文献中的第一个证据，表明可以开发出在各种条件下具有强大的开箱即用性能的说话人验证系统。我们表明，PLDA 和自适应校准器的联合训练是必不可少的——在冻结 PLDA 和微调校准器时无法实现相同的好处。据我们所知，本文的结果是文献中的第一个证据，表明可以开发出在各种条件下具有强大的开箱即用性能的说话人验证系统。

更新日期：2021-07-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>