当前位置:
X-MOL 学术
›
arXiv.cs.MM
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using Deep Learning Techniques and Inferential Speech Statistics for AI Synthesised Speech Recognition
arXiv - CS - Multimedia Pub Date : 2021-07-23 , DOI: arxiv-2107.11412 Arun Kumar SinghIndian Institute of Technology Jammu, Priyanka SinghDhirubhai Ambani Institute of Information and Communication Technology, Karan NathwaniIndian Institute of Technology Jammu
arXiv - CS - Multimedia Pub Date : 2021-07-23 , DOI: arxiv-2107.11412 Arun Kumar SinghIndian Institute of Technology Jammu, Priyanka SinghDhirubhai Ambani Institute of Information and Communication Technology, Karan NathwaniIndian Institute of Technology Jammu
The recent developments in technology have re-warded us with amazing audio
synthesis models like TACOTRON and WAVENETS. On the other side, it poses
greater threats such as speech clones and deep fakes, that may go undetected.
To tackle these alarming situations, there is an urgent need to propose models
that can help discriminate a synthesized speech from an actual human speech and
also identify the source of such a synthesis. Here, we propose a model based on
Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network
(BiRNN) that helps to achieve both the aforementioned objectives. The temporal
dependencies present in AI synthesized speech are exploited using Bidirectional
RNN and CNN. The model outperforms the state-of-the-art approaches by
classifying the AI synthesized audio from real human speech with an error rate
of 1.9% and detecting the underlying architecture with an accuracy of 97%.
中文翻译:
使用深度学习技术和推理语音统计进行人工智能合成语音识别
最近的技术发展为我们提供了令人惊叹的音频合成模型,如 TACOTRON 和 WAVENETS。另一方面,它带来了更大的威胁,例如语音克隆和深度伪造,这些可能未被发现。为了解决这些令人担忧的情况,迫切需要提出可以帮助区分合成语音和实际人类语音并识别这种合成来源的模型。在这里,我们提出了一个基于卷积神经网络 (CNN) 和双向循环神经网络 (BiRNN) 的模型,有助于实现上述两个目标。使用双向 RNN 和 CNN 来利用 AI 合成语音中存在的时间依赖性。该模型通过对来自真实人类语音的 AI 合成音频进行分类,错误率为 1,从而优于最先进的方法。
更新日期:2021-07-27
中文翻译:
使用深度学习技术和推理语音统计进行人工智能合成语音识别
最近的技术发展为我们提供了令人惊叹的音频合成模型,如 TACOTRON 和 WAVENETS。另一方面,它带来了更大的威胁,例如语音克隆和深度伪造,这些可能未被发现。为了解决这些令人担忧的情况,迫切需要提出可以帮助区分合成语音和实际人类语音并识别这种合成来源的模型。在这里,我们提出了一个基于卷积神经网络 (CNN) 和双向循环神经网络 (BiRNN) 的模型,有助于实现上述两个目标。使用双向 RNN 和 CNN 来利用 AI 合成语音中存在的时间依赖性。该模型通过对来自真实人类语音的 AI 合成音频进行分类,错误率为 1,从而优于最先进的方法。