Shallow and deep feature fusion for digital audio tampering detection,EURASIP Journal on Advances in Signal Processing

当前位置： X-MOL 学术 › EURASIP J. Adv. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Shallow and deep feature fusion for digital audio tampering detection
EURASIP Journal on Advances in Signal Processing ( IF 1.9 ) Pub Date : 2022-08-13 , DOI: 10.1186/s13634-022-00900-4
Zhifeng Wang , Yao Yang , Chunyan Zeng , Shuai Kong , Shixiong Feng , Nan Zhao

Digital audio tampering detection can be used to verify the authenticity of digital audio. However, most current methods use standard electronic network frequency (ENF) databases for visual comparison analysis of ENF continuity of digital audio or perform feature extraction for classification by machine learning methods. ENF databases are usually tricky to obtain, visual methods have weak feature representation, and machine learning methods have more information loss in features, resulting in low detection accuracy. This paper proposes a fusion method of shallow and deep features to fully use ENF information by exploiting the complementary nature of features at different levels to more accurately describe the changes in inconsistency produced by tampering operations to raw digital audio. Firstly, the audio signal is band-pass filtered to obtain the ENF component. Then, the discrete Fourier transform (DFT) and Hilbert transform are performed to obtain the phase and instantaneous frequency of the ENF component. Secondly, the mean value of the sequence variation is used as the shallow feature; the feature matrix obtained by framing and reshaping of the ENF sequence is used as the input of the convolutional neural network; the characteristics of the fitted coefficients are obtained by curve fitting. Then, the local details of ENF are obtained from the feature matrix by the convolutional neural network, and the global information of ENF is obtained by fitting coefficient features through deep neural network (DNN). The depth features of ENF are composed of ENF global information and local information together. The shallow and deep features are fused using an attention mechanism to give greater weights to features useful for classification and suppress invalid features. Finally, the tampered audio is detected by downscaling and fitting with a DNN containing two fully connected layers, and classification is performed using a Softmax layer. The method achieves 97.03% accuracy on three classic databases: Carioca 1, Carioca 2, and New Spanish. In addition, we have achieved an accuracy of 88.31% on the newly constructed database GAUDI-DI. Experimental results show that the proposed method is superior to the state-of-the-art method.

中文翻译：

用于数字音频篡改检测的浅层和深层特征融合

数字音频篡改检测可用于验证数字音频的真实性。然而，当前大多数方法使用标准电子网络频率 (ENF) 数据库对数字音频的 ENF 连续性进行视觉比较分析，或者通过机器学习方法进行特征提取以进行分类。ENF数据库通常难以获取，视觉方法的特征表示较弱，而机器学习方法的特征信息损失较多，导致检测准确率低。本文提出了一种浅层和深层特征的融合方法，充分利用ENF信息，利用不同层次特征的互补性，更准确地描述篡改操作对原始数字音频产生的不一致性变化。首先，对音频信号进行带通滤波以获得 ENF 分量。然后，执行离散傅里叶变换（DFT）和希尔伯特变换以获得ENF分量的相位和瞬时频率。其次，将序列变异的平均值作为浅层特征；对ENF序列进行分帧和整形得到的特征矩阵作为卷积神经网络的输入；通过曲线拟合得到拟合系数的特征。然后，通过卷积神经网络从特征矩阵中得到ENF的局部细节，通过深度神经网络（DNN）拟合系数特征得到ENF的全局信息。ENF的深度特征由ENF全局信息和局部信息共同组成。使用注意机制融合浅层和深层特征，以赋予对分类有用的特征更大的权重并抑制无效特征。最后，通过缩小和拟合包含两个全连接层的 DNN 来检测被篡改的音频，并使用 Softmax 层进行分类。该方法在三个经典数据库上实现了 97.03% 的准确率：Carioca 1、Carioca 2 和 New Spanish。此外，我们在新建的数据库 GAUDI-DI 上达到了 88.31% 的准确率。实验结果表明，所提出的方法优于最先进的方法。被篡改的音频通过缩小和拟合包含两个全连接层的 DNN 来检测，并使用 Softmax 层执行分类。该方法在三个经典数据库上实现了 97.03% 的准确率：Carioca 1、Carioca 2 和 New Spanish。此外，我们在新建的数据库 GAUDI-DI 上达到了 88.31% 的准确率。实验结果表明，所提出的方法优于最先进的方法。被篡改的音频通过缩小和拟合包含两个全连接层的 DNN 来检测，并使用 Softmax 层执行分类。该方法在三个经典数据库上实现了 97.03% 的准确率：Carioca 1、Carioca 2 和 New Spanish。此外，我们在新建的数据库 GAUDI-DI 上达到了 88.31% 的准确率。实验结果表明，所提出的方法优于最先进的方法。

更新日期：2022-08-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>