当前位置: X-MOL 学术IEEE Trans. Inform. Forensics Secur. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Estimating Uniqueness of I-Vector-Based Representation of Human Voice
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 2021-04-14 , DOI: 10.1109/tifs.2021.3071574
Sinan E. Tandogan , Husrev Taha Sencar

We study the individuality of the human voice with respect to a widely used feature representation of speech utterances, namely, the i-vector model. As a first step toward this goal, we compare and contrast uniqueness measures proposed for different biometric modalities. Then, we introduce a new uniqueness measure that evaluates the entropy of i-vectors while taking into account speaker level variations. Our measure operates in the discrete feature space and relies on accurate estimation of the distribution of i-vectors. Therefore, i-vectors are quantized while ensuring that both the quantized and original representations yield similar speaker verification performance. Uniqueness estimates are obtained from two newly generated datasets and the public VoxCeleb dataset. The first custom dataset contains more than one and a half million speech samples of 20,741 speakers obtained from TEDx Talks videos. The second one includes over twenty one thousand speech samples from 1,595 actors that are extracted from movie dialogues. Using this data, we analyzed how several factors, such as the number of speakers, number of samples per speaker, sample durations, and diversity of utterances affect uniqueness estimates. Most notably, we determine that the discretization of i-vectors does not cause a reduction in speaker recognition performance. Our results show that the degree of distinctiveness offered by i-vector-based representation may reach 43-70 bits considering 5-second long speech samples; however, under less constrained variations in speech, uniqueness estimates are found to reduce by around 30 bits. We also find that doubling the sample duration increases the distinctiveness of the i-vector representation by around 20 bits.

中文翻译:


估计基于 I 向量的人声表示的唯一性



我们根据广泛使用的语音特征表示(即 i 向量模型)来研究人声的个性。作为实现这一目标的第一步,我们比较和对比了针对不同生物识别模式提出的独特性度量。然后,我们引入了一种新的唯一性度量,该度量在考虑扬声器级别变化的同时评估 i 向量的熵。我们的测量在离散特征空间中运行,并依赖于对 i 向量分布的准确估计。因此,i 向量被量化,同时确保量化和原始表示产生相似的说话人验证性能。唯一性估计是从两个新生成的数据集和公共 VoxCeleb 数据集获得的。第一个自定义数据集包含从 TEDx Talks 视频中获取的 20,741 名发言者的超过 50 万个语音样本。第二个包含从电影对话中提取的 1,595 名演员的两万多个语音样本。使用这些数据,我们分析了几个因素(例如说话者数量、每个说话者的样本数量、样本持续时间和话语多样性)如何影响独特性估计。最值得注意的是,我们确定 i 向量的离散化不会导致说话人识别性能的降低。我们的结果表明,考虑到 5 秒长的语音样本,基于 i 向量的表示提供的独特性程度可能达到 43-70 位;然而,在语音变化受到较少限制的情况下,唯一性估计会减少大约 30 位。我们还发现,将样本持续时间加倍可使 i 向量表示的独特性增加约 20 位。
更新日期:2021-04-14
down
wechat
bug