Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection,IEEE Journal of Selected Topics in Signal Processing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection
IEEE Journal of Selected Topics in Signal Processing ( IF 7.5 ) Pub Date : 2020-08-01 , DOI: 10.1109/jstsp.2020.2999185
Akash Chintha , Bao Thai , Saniat Javid Sohrawardi , Kartavya Bhatt , Andrea Hickerson , Matthew Wright , Raymond Ptucha

Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.

中文翻译：

用于音频欺骗和视频 Deepfake 检测的循环卷积结构

Deepfakes 或人工生成的视听效果图可用于诽谤公众人物或影响公众舆论。随着最近对生成对抗网络的发现，攻击者使用配备有现成图形处理单元的普通台式计算机可以使再现足够逼真，可以轻松欺骗人类观察者。因此，检测深度伪造对于记者、社交媒体平台和公众来说变得越来越重要。在这项工作中，我们介绍了用于音频欺骗和视觉深度伪造检测的简单但非常有效的数字取证方法。我们的方法将卷积潜在表示与双向循环结构和基于熵的成本函数相结合。仔细选择音频和视频的潜在表示，以从录音中提取语义丰富的信息。通过将这些输入到一个循环框架中，我们可以检测到 deepfake 再现的空间和时间特征。基于熵的成本函数在孤立以及与传统成本函数的上下文中都能很好地工作。我们在 FaceForensics++ 和 Celeb-DF 视频数据集以及 ASVSpoof 2019 逻辑访问音频数据集上展示了我们的方法，在所有类别中实现了新的基准测试。我们还进行了广泛的研究，以证明对新领域的泛化，并进一步了解新架构的有效性。基于熵的成本函数在孤立以及与传统成本函数的上下文中都能很好地工作。我们在 FaceForensics++ 和 Celeb-DF 视频数据集以及 ASVSpoof 2019 逻辑访问音频数据集上展示了我们的方法，在所有类别中实现了新的基准测试。我们还进行了广泛的研究，以证明对新领域的泛化，并进一步了解新架构的有效性。基于熵的成本函数在孤立以及与传统成本函数的上下文中都能很好地工作。我们在 FaceForensics++ 和 Celeb-DF 视频数据集以及 ASVSpoof 2019 逻辑访问音频数据集上展示了我们的方法，在所有类别中实现了新的基准测试。我们还进行了广泛的研究，以证明对新领域的泛化，并进一步了解新架构的有效性。

更新日期：2020-08-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>