当前位置: X-MOL 学术ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SAMAF
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-05-25 , DOI: 10.1145/3380828
Abraham Báez-Suárez 1 , Nolan Shah 2 , Juan Arturo Nolazco-Flores 3 , Shou-Hsuan S. Huang 2 , Omprakash Gnawali 2 , Weidong Shi 2
Affiliation  

Audio fingerprinting techniques were developed to index and retrieve audio samples by comparing a content-based compact signature of the audio instead of the entire audio sample, thereby reducing memory and computational expense. Different techniques have been applied to create audio fingerprints; however, with the introduction of deep learning, new data-driven unsupervised approaches are available. This article presents Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF), which improved hash generation through a novel loss function composed of terms: Mean Square Error, minimizing the reconstruction error; Hash Loss, minimizing the distance between similar hashes and encouraging clustering; and Bitwise Entropy Loss, minimizing the variation inside the clusters. The performance of the model was assessed with a subset of VoxCeleb1 dataset, a“speech in-the-wild” dataset. Furthermore, the model was compared against three baselines: Dejavu, a Shazam-like algorithm; Robust Audio Fingerprinting System (RAFS), a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations; and Panako, a constellation-based algorithm adding time-frequency distortion resilience. Extensive empirical evidence showed that our approach outperformed all the baselines in the audio identification task and other classification tasks related to the attributes of the audio signal with an economical hash size of either 128 or 256 bits for one second of audio.

中文翻译:

萨马夫

开发了音频指纹技术,通过比较音频的基于内容的紧凑签名而不是整个音频样本来索引和检索音频样本,从而减少内存和计算费用。已应用不同的技术来创建音频指纹;然而,随着深度学习的引入,可以使用新的数据驱动的无监督方法。本文介绍了音频指纹的序列到序列自动编码器模型 (SAMAF),该模型通过由以下项组成的新颖损失函数改进了哈希生成:均方误差,最小化重建误差;Hash Loss,最小化相似哈希之间的距离并鼓励聚类;和位熵损失,最小化集群内部的变化。使用 VoxCeleb1 数据集的一个子集评估模型的性能,这是一个“野外语音”数据集。此外,该模型与三个基线进行了比较:Dejavu,一种类似 Shazam 的算法;强大的音频指纹系统 (RAFS),一种对时频失真和编码/解码转换具有鲁棒性的误码率 (BER) 方法;和 Panako,一种基于星座的算法,增加了时频失真弹性。大量经验证据表明,我们的方法优于音频识别任务和其他与音频信号属性相关的分类任务中的所有基线,一秒音频的经济哈希大小为 128 或 256 位。强大的音频指纹系统 (RAFS),一种对时频失真和编码/解码转换具有鲁棒性的误码率 (BER) 方法;和 Panako,一种基于星座的算法,增加了时频失真弹性。大量经验证据表明,我们的方法优于音频识别任务和其他与音频信号属性相关的分类任务中的所有基线,一秒音频的经济哈希大小为 128 或 256 位。强大的音频指纹系统 (RAFS),一种对时频失真和编码/解码转换具有鲁棒性的误码率 (BER) 方法;和 Panako,一种基于星座的算法,增加了时频失真弹性。大量经验证据表明,我们的方法优于音频识别任务和其他与音频信号属性相关的分类任务中的所有基线,一秒音频的经济哈希大小为 128 或 256 位。
更新日期:2020-05-25
down
wechat
bug