When Automatic Voice Disguise Meets Automatic Speaker Verification,IEEE Transactions on Information Forensics and Security

当前位置： X-MOL 学术 › IEEE Trans. Inform. Forensics Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

When Automatic Voice Disguise Meets Automatic Speaker Verification
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 9-16-2020 , DOI: 10.1109/tifs.2020.3023818
Linlin Zheng , Jiakang Li , Meng Sun , Xiongwei Zhang , Thomas Fang Zheng

The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (ASV). In this paper, we have found that ASV is not only a victim of AVD but could be a tool to beat some simple types of AVD. Firstly, three types of AVD, pitch scaling, vocal tract length normalization (VTLN) and voice conversion (VC), are introduced as representative methods. State-of-the-art ASV methods are subsequently utilized to objectively evaluate the impact of AVD on ASV by equal error rates (EER). Moreover, an approach to restore disguised voice to its original version is proposed by minimizing a function of ASV scores w.r.t. restoration parameters. Experiments are then conducted on disguised voices from Voxceleb, a dataset recorded in real-world noisy scenario. The results have shown that, for the voice disguise by pitch scaling, the proposed approach obtains an EER around 7% comparing to the 30% EER of a recently proposed baseline using the ratio of fundamental frequencies. The proposed approach generalizes well to restore the disguise with nonlinear frequency warping in VTLN by reducing its EER from 34.3% to 18.5%. However, it is difficult to restore the source speakers in VC by our approach, where more complex forms of restoration functions or other paralinguistic cues might be necessary to restore the nonlinear transform in VC. Finally, contrastive visualization on ASV features with and without restoration illustrate the role of the proposed approach in an intuitive way.

中文翻译：

当自动语音伪装遇上自动说话人验证

为了隐藏说话者的真实身份而对声音进行变换的技术被称为语音伪装，其中自动语音伪装（AVD）通过使用各种算法修改声音的频谱和时间特性，可以使用公众可以使用的软件轻松地进行。 AVD 对人类听力和自动说话人验证（ASV）都构成了巨大威胁。在本文中，我们发现 ASV 不仅是 AVD 的受害者，而且可以成为击败某些简单类型 AVD 的工具。首先，介绍三种类型的AVD，即音调缩放、声道长度归一化（VTLN）和语音转换（VC）作为代表性方法。随后利用最先进的 ASV 方法通过等错误率 (EER) 客观评估 AVD 对 ASV 的影响。此外，还提出了一种通过最小化 ASV 分数与恢复参数的函数来将伪装语音恢复为其原始版本的方法。然后对来自 Voxceleb 的伪装声音进行实验，Voxceleb 是在现实世界噪声场景中记录的数据集。结果表明，对于通过音高缩放进行的语音伪装，与最近提出的使用基频比的基线的 30% EER 相比，所提出的方法获得了约 7% 的 EER。所提出的方法可以很好地推广，通过将 VTLN 的 EER 从 34.3% 降低到 18.5%，恢复 VTLN 中非线性频率扭曲的伪装。然而，通过我们的方法很难恢复 VC 中的源说话者，其中可能需要更复杂形式的恢复函数或其他副语言线索来恢复 VC 中的非线性变换。最后，有和没有恢复的 ASV 特征的对比可视化以直观的方式说明了所提出的方法的作用。

更新日期：2024-08-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11