当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Time–frequency scattering accurately models auditory similarities between instrumental playing techniques
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-01-11 , DOI: 10.1186/s13636-020-00187-z
Vincent Lostanlen , Christian El-Hajj , Mathias Rossignol , Grégoire Lafay , Joakim Andén , Mathieu Lagrange

Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called “ordinary” technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time–frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 9 9 . 0 % ± 1 . An ablation study demonstrates that removing either the joint time–frequency scattering transform or the metric learning algorithm noticeably degrades performance.

中文翻译:

时频散射准确地模拟了乐器演奏技术之间的听觉相似性

器乐演奏技巧,如颤音、滑音和颤音,通常表示古典和民谣中的音乐表现力。然而,大多数现有的音乐相似性检索方法无法描述超出所谓“普通”技术的音色,使用乐器身份作为音色质量的代理,并且不允许对新主题的感知特性进行定制。在本文中,我们要求 31 位人类参与者将 78 个孤立的音符组织成一组音色簇。分析他们的反应表明,与单独由乐器或演奏技术提供的分类相比,音色感知在更灵活的分类中运作。此外,我们提出了一种机器聆听模型来恢复跨乐器、静音和技术的听觉相似性的聚类图。我们的模型依赖于联合时频散射特征来提取光谱时间调制作为声学特征。此外,它通过大边界最近邻 (LMNN) 度量学习算法最小化集群图中的三元组损失。在 9346 个孤立笔记的数据集上,我们报告了 9 9 排名 5 (AP@5) 的最新平均精度。0 % ± 1 . 消融研究表明,移除联合时频散射变换或度量学习算法会显着降低性能。我们报告了 9 9 排名 5 (AP@5) 的最先进平均精度。0 % ± 1 . 消融研究表明,移除联合时频散射变换或度量学习算法会显着降低性能。我们报告了 9 9 排名 5 (AP@5) 的最先进平均精度。0 % ± 1 . 消融研究表明,移除联合时频散射变换或度量学习算法会显着降低性能。
更新日期:2021-01-11
down
wechat
bug