AVID: A speech database for machine learning studies on vocal intensity,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

AVID: A speech database for machine learning studies on vocal intensity
Speech Communication ( IF 3.2 ) Pub Date : 2024-01-23 , DOI: 10.1016/j.specom.2024.103039
Paavo Alku , Manila Kodali , Laura Laaksonen , Sudarsana Reddy Kadiri

Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learning (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.

中文翻译：

AVID：用于声音强度机器学习研究的语音数据库

声音强度通常用声压级 (SPL) 进行量化，是语音的一个关键特征。要从语音录音测量 SPL，需要将标准校准音（参考 SPL 为 94 dB 或 114 dB）与语音一起记录。然而，语音和说话人识别等领域使用的大多数流行数据库都是通过在任意幅度范围内表达语音来记录的，而没有校准信息。因此，有关录制语音的声音强度（包括 SPL）的信息会丢失。在当前的研究中，我们引入了一个新的开放且校准的语音/电声门图（EGG）数据库，名为阿尔托声音强度数据库（AVID）。AVID 包括 50 名说话者（25 名男性，25 名女性）产生的语音和 EGG，他们将声音强度分为四个类别（轻声、正常、响亮和非常响亮）。使用恒定的嘴到麦克风的距离并通过记录校准音来进行录音。语音数据使用总共 19 个标签按句子进行标记，这些标签支持在基于监督学习的声音强度机器学习 (ML) 研究中使用数据。为了演示如何使用 AVID 数据来研究声音强度，我们研究了一项多类分类任务（将语音分类为轻柔、正常、响亮和非常响亮的强度类别）和一项回归任务（预测语音的 SPL））。在这两个任务中，我们故意扭曲输入语音的电平，将信号归一化，使其最大幅度等于 1.0，也就是说，我们模拟了当前语音数据库中普遍存在的场景。结果表明，使用声谱图特征和支持向量机分类器在声音强度类别的多类分类中给出了 82% 的准确率。在 SPL 预测中，使用声谱图特征和支持向量回归器给出了约 2 dB 的平均绝对误差和 92% 的确定系数。我们欢迎对分类和回归问题感兴趣的研究人员利用 AVID 来研究声音强度，我们希望当前的结果可以作为未来该主题的 ML 研究的基线。

更新日期：2024-01-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>