Using Deep Learning Techniques and Inferential Speech Statistics for AI Synthesised Speech Recognition,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Using Deep Learning Techniques and Inferential Speech Statistics for AI Synthesised Speech Recognition
arXiv - CS - Multimedia Pub Date : 2021-07-23 , DOI: arxiv-2107.11412
Arun Kumar SinghIndian Institute of Technology Jammu, Priyanka SinghDhirubhai Ambani Institute of Information and Communication Technology, Karan NathwaniIndian Institute of Technology Jammu

The recent developments in technology have re-warded us with amazing audio synthesis models like TACOTRON and WAVENETS. On the other side, it poses greater threats such as speech clones and deep fakes, that may go undetected. To tackle these alarming situations, there is an urgent need to propose models that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis. Here, we propose a model based on Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network (BiRNN) that helps to achieve both the aforementioned objectives. The temporal dependencies present in AI synthesized speech are exploited using Bidirectional RNN and CNN. The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.

中文翻译：

使用深度学习技术和推理语音统计进行人工智能合成语音识别

最近的技术发展为我们提供了令人惊叹的音频合成模型，如 TACOTRON 和 WAVENETS。另一方面，它带来了更大的威胁，例如语音克隆和深度伪造，这些可能未被发现。为了解决这些令人担忧的情况，迫切需要提出可以帮助区分合成语音和实际人类语音并识别这种合成来源的模型。在这里，我们提出了一个基于卷积神经网络 (CNN) 和双向循环神经网络 (BiRNN) 的模型，有助于实现上述两个目标。使用双向 RNN 和 CNN 来利用 AI 合成语音中存在的时间依赖性。该模型通过对来自真实人类语音的 AI 合成音频进行分类，错误率为 1，从而优于最先进的方法。

更新日期：2021-07-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>