Xi-Vector Embedding for Speaker Recognition,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Xi-Vector Embedding for Speaker Recognition
IEEE Signal Processing Letters ( IF 3.2 ) Pub Date : 2021-06-23 , DOI: 10.1109/lsp.2021.3091932
Kong Aik Lee , Qiongqiong Wang , Takafumi Koshinaka

We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. On the technology front, we offer a simple and straightforward extension to the now widely used x-vector. It consists of an auxiliary neural net predicting the frame-wise uncertainty of the input sequence. We show that the proposed extension leads to substantial improvement across all operating points, with a significant reduction in error rates and detection cost. On the theoretical front, our proposal integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks via the pooling layer. In one sense, our proposal integrates the Bayesian formulation of the i-vector to that of the x-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector. Experimental results on the SITW evaluation set show a consistent improvement of over 17.5% in equal-error-rate and 10.9% in minimum detection cost.

中文翻译：

用于说话人识别的 Xi 向量嵌入

我们提出了深度说话人嵌入的贝叶斯公式，其中 xi 向量是 x 向量的贝叶斯对应项，考虑到不确定性估计。在技术方面，我们为现在广泛使用的 x 向量提供了简单直接的扩展。它由一个辅助神经网络组成，预测输入序列的逐帧不确定性。我们表明，所提出的扩展可以显着改善所有操作点，并显着降低错误率和检测成本。在理论方面，我们的建议通过池化层将线性高斯模型的贝叶斯公式集成到说话者嵌入神经网络中。从某种意义上说，我们的建议将 i 向量的贝叶斯公式整合到 x 向量的贝叶斯公式中。因此，我们将嵌入称为 xi 向量，发音为 /zai/ 向量。 SITW 评估集上的实验结果表明，等错误率持续提高了 17.5% 以上，最小检测成本提高了 10.9%。

更新日期：2021-06-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11