Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2022-05-16 , DOI: 10.1186/s13636-022-00245-8
Lekshmi Chandrika Reghunath , Rajeev Rajan

Multiple predominant instrument recognition in polyphonic music is addressed using decision level fusion of three transformer-based architectures on an ensemble of visual representations. The ensemble consists of Mel-spectrogram, modgdgram, and tempogram. Predominant instrument recognition refers to the problem where the prominent instrument is identified from a mixture of instruments being played together. We experimented with two transformer architectures like Vision transformer (Vi-T) and Shifted window transformer (Swin-T) for the proposed task. The performance of the proposed system is compared with that of the state-of-the-art Han’s model, convolutional neural networks (CNN), and deep neural networks (DNN). Transformer networks learn the distinctive local characteristics from the visual representations and classify the instrument to the group where it belongs. The proposed system is systematically evaluated using the IRMAS dataset with eleven classes. A wave generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We train our networks from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from the variable-length test audio file without any sliding window analysis and aggregation strategy as in existing algorithms. The ensemble voting scheme using Swin-T reports a micro and macro F1 score of 0.66 and 0.62, respectively. These metrics are 3.12% and 12.72% relatively higher than those obtained by the state-of-the-art Han’s model. The architectural choice of transformers with ensemble voting on Mel-spectro-/modgd-/tempogram has merit in recognizing the predominant instruments in polyphonic music.

中文翻译：

基于变压器的复调音乐中多种主要乐器识别的集成方法

使用三个基于变压器的架构在视觉表示集合上的决策级融合来解决复调音乐中的多种主要乐器识别问题。集成由梅尔谱图、modgdgram 和 tempogram 组成。主要乐器识别是指从一起演奏的混合乐器中识别主要乐器的问题。我们为提议的任务试验了两种变压器架构，如视觉变压器（Vi-T）和移位窗口变压器（Swin-T）。将所提出系统的性能与最先进的 Han 模型、卷积神经网络 (CNN) 和深度神经网络 (DNN) 的性能进行比较。Transformer 网络从视觉表示中学习独特的局部特征，并将仪器分类到它所属的组。使用具有 11 个类别的 IRMAS 数据集系统地评估所提出的系统。波生成对抗网络 (WaveGAN) 架构也用于生成音频文件以进行数据增强。我们使用单一标记的主要乐器从固定长度的音乐摘录中训练我们的网络，并从可变长度的测试音频文件中估计任意数量的主要乐器，而无需像现有算法中那样进行任何滑动窗口分析和聚合策略。使用 Swin-T 的集成投票方案报告的微观和宏观 F1 分数分别为 0.66 和 0.62。这些指标分别为 3.12% 和 12。比最先进的 Han 模型获得的值高出 72%。在 Mel-spectro-/modgd-/tempogram 上进行合奏投票的变压器的架构选择在识别和弦音乐中的主要乐器方面具有优势。

更新日期：2022-05-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文