当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions
arXiv - CS - Multimedia Pub Date : 2021-05-01 , DOI: arxiv-2105.00335
Prateek Verma, Jonathan Berger

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

中文翻译:


音频变压器:用于大规模音频理解的变压器架构。再见,卷积



在过去的二十年里,CNN 架构已经产生了令人信服的声音感知和认知模型,学习特征的分层组织。与计算机视觉领域的成功类似,音频特征分类可以针对各种感兴趣的特定任务在各种数据集和标签上进行优化。事实上,为图像理解而设计的类似架构已被证明对于声学场景分析是有效的。在这里,我们建议将没有卷积层的基于 Transformer 的架构应用于原始音频信号。在包含 200 个类别的 Free Sound 50K 标准数据集上,我们的模型优于卷积模型,可以产生最先进的结果。这很重要,因为与自然语言处理和计算机视觉不同,我们不会为了优于卷积架构而进行无监督的预训练。在相同的训练集上,就平均精度基准而言,我们显示出显着的改进。我们通过使用受过去几年设计的卷积网络启发的池化等技术,进一步提高了 Transformer 架构的性能。此外,我们还展示了如何将受小波启发的多速率信号处理思想应用于 Transformer 嵌入以改进结果。我们还展示了我们的模型如何学习非线性非恒定带宽滤波器组,该滤波器组显示了用于音频理解任务的自适应时频前端表示,与其他任务(例如音调估计)不同。
更新日期:2021-05-01
down
wechat
bug