当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Paralinguistic singing attribute recognition using supervised machine learning for describing the classical tenor solo singing voice in vocal pedagogy
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2022-04-15 , DOI: 10.1186/s13636-022-00240-z
Yanze Xu 1 , Weiqing Wang 1 , Huahua Cui 2 , Mingyang Xu 2 , Ming Li 1
Affiliation  

Humans can recognize someone’s identity through their voice and describe the timbral phenomena of voices. Likewise, the singing voice also has timbral phenomena. In vocal pedagogy, vocal teachers listen and then describe the timbral phenomena of their student’s singing voice. In this study, in order to enable machines to describe the singing voice from the vocal pedagogy point of view, we perform a task called paralinguistic singing attribute recognition. To achieve this goal, we first construct and publish an open source dataset named Singing Voice Quality and Technique Database (SVQTD) for supervised learning. All the audio clips in SVQTD are downloaded from YouTube and processed by music source separation and silence detection. For annotation, seven paralinguistic singing attributes commonly used in vocal pedagogy are adopted as the labels. Furthermore, to explore the different supervised machine learning algorithm for classifying each paralinguistic singing attribute, we adopt three main frameworks, namely openSMILE features with support vector machine (SF-SVM), end-to-end deep learning (E2EDL), and deep embedding with support vector machine (DE-SVM). Our methods are based on existing frameworks commonly employed in other paralinguistic speech attribute recognition tasks. In SF-SVM, we separately use the feature set of the INTERSPEECH 2009 Challenge and that of the INTERSPEECH 2016 Challenge as the SVM classifier’s input. In E2EDL, the end-to-end framework separately utilizes the ResNet and transformer encoder as feature extractors. In particular, to handle two-dimensional spectrogram input for a transformer, we adopt a sliced multi-head self-attention (SMSA) mechanism. In the DE-SVM, we use the representation extracted from the E2EDL model as the input of the SVM classifier. Experimental results on SVQTD show no absolute winner between E2EDL and the DE-SVM, which means that the back-end SVM classifier with the representation learned by E2E as input does not necessarily improve the performance. However, the DE-SVM that utilizes the ResNet as the feature extractor achieves the best average UAR, with an average 16% improvement over that of the SF-SVM with INTERSPEECH’s hand-crafted feature set.

中文翻译:

使用监督机器学习的副语言歌唱属性识别用于描述声乐教育学中的古典男高音独唱声音

人类可以通过声音识别某人的身份,并描述声音的音色现象。同样,歌声也有音色现象。在声乐教学法中,声乐教师聆听并描述学生歌声的音色现象。在这项研究中,为了使机器能够从声乐教育学的角度描述歌声,我们执行了一项称为副语言歌唱属性识别的任务。为了实现这一目标,我们首先构建并发布了一个名为 Singing Voice Quality and Technical Database (SVQTD) 的开源数据集,用于监督学习。SVQTD 中的所有音频片段都是从 YouTube 下载的,并经过音乐源分离和静音检测处理。对于注释,采用声乐教学中常用的七种副语言歌唱属性作为标签。此外,为了探索对每个副语言歌唱属性进行分类的不同监督机器学习算法,我们采用了三个主要框架,即带有支持向量机的 openSMILE 特征 (SF-SVM)、端到端深度学习 (E2EDL) 和深度嵌入支持向量机(DE-SVM)。我们的方法基于其他副语言语音属性识别任务中常用的现有框架。在 SF-SVM 中,我们分别使用 INTERSPEECH 2009 Challenge 和 INTERSPEECH 2016 Challenge 的特征集作为 SVM 分类器的输入。在 E2EDL 中,端到端框架分别使用 ResNet 和 Transformer 编码器作为特征提取器。特别是,为了处理变换器的二维频谱图输入,我们采用了切片多头自注意力(SMSA)机制。在 DE-SVM 中,我们使用从 E2EDL 模型中提取的表示作为 SVM 分类器的输入。在 SVQTD 上的实验结果表明,E2EDL 和 DE-SVM 之间没有绝对的赢家,这意味着以 E2E 学习的表示作为输入的后端 SVM 分类器并不一定会提高性能。然而,使用 ResNet 作为特征提取器的 DE-SVM 实现了最佳的平均 UAR,比使用 INTERSPEECH 的手工特征集的 SF-SVM 平均提高了 16%。
更新日期:2022-04-15
down
wechat
bug