ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection,IEEE Transactions on Information Forensics and Security

当前位置： X-MOL 学术 › IEEE Trans. Inform. Forensics Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 1-23-2023 , DOI: 10.1109/tifs.2023.3239223
Cairong Zhao ₁ , Chutian Wang ₁ , Guosheng Hu ₂ , Haonan Chen ₃ , Chun Liu ₄ , Jinhui Tang ₅

Affiliation

With the rapid development of Deepfake synthesis technology, our information security and personal privacy have been severely threatened in recent years. To achieve a robust Deepfake detection, researchers attempt to exploit the joint spatial-temporal information in the videos, like using recurrent networks and 3D convolutional networks. However, these spatial-temporal models remain room to improve. Another general challenge for spatial-temporal models is that people do not clearly understand what these spatial-temporal models really learn. To address these two challenges, in this paper, we propose an Interpretable Spatial-Temporal Video Transformer (ISTVT), which consists of a novel decomposed spatial-temporal self-attention and a self-subtract mechanism to capture spatial artifacts and temporal inconsistency for robust Deepfake detection. Thanks to this decomposition, we propose to interpret ISTVT by visualizing the discriminative regions for both spatial and temporal dimensions via the relevance (the pixel-wise importance on the input) propagation algorithm. We conduct extensive experiments on large-scale datasets, including FaceForensics++, FaceShifter, DeeperForensics, Celeb-DF, and DFDC datasets. Our strong performance of intra-dataset and cross-dataset Deepfake detection demonstrates the effectiveness and robustness of our method, and our visualization-based interpretability offers people insights into our model.

中文翻译：

ISTVT：用于 Deepfake 检测的可解释时空视频转换器

随着Deepfake合成技术的快速发展，我们的信息安全和个人隐私近年来受到严重威胁。为了实现强大的 Deepfake 检测，研究人员尝试利用视频中的联合时空信息，例如使用循环网络和 3D 卷积网络。然而，这些时空模型仍有改进的空间。时空模型的另一个普遍挑战是人们并不清楚这些时空模型真正学到了什么。为了解决这两个挑战，在本文中，我们提出了一种可解释的时空视频转换器（ISTVT），它由一种新颖的分解时空自注意力和自减机制组成，用于捕获空间伪影和时间不一致性以实现鲁棒性深度伪造检测。由于这种分解，我们建议通过相关性（输入的像素重要性）传播算法可视化空间和时间维度的判别区域来解释 ISTVT。我们对大规模数据集进行了大量实验，包括 FaceForensics++、FaceShifter、DeeperForensics、Celeb-DF 和 DFDC 数据集。我们在数据集内和跨数据集 Deepfake 检测方面的出色性能证明了我们方法的有效性和稳健性，并且我们基于可视化的可解释性为人们提供了对我们模型的见解。

更新日期：2024-08-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11