当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fundamental frequency feature warping for frequency normalization and data augmentation in child automatic speech recognition
Speech Communication ( IF 2.4 ) Pub Date : 2021-09-14 , DOI: 10.1016/j.specom.2021.08.002
Gary Yeung 1 , Ruchao Fan 1 , Abeer Alwan 1
Affiliation  

Effective child automatic speech recognition (ASR) systems have become increasingly important due to the growing use of interactive technology. Due to the lack of publicly available child speech databases, young child ASR systems often rely on older child or adult speech for training data. However, there is a large acoustic mismatch between child and adult speech. This study proposes a novel fundamental frequency (fo)-based frequency warping technique for both frequency normalization and data augmentation to combat this acoustic mismatch and address the lack of available child speech training data. The technique is inspired by the tonotopic distances between formants and fo, developed to model human vowel perception. The tonotopic distances are reformulated as a linear relationship between fo and vowel formants on the Mel scale. This reformulation is verified using fo and formant measurements from child utterances. The relationship is further generalized such that the frequency warping technique only relies on two parameters. The LibriSpeech ASR corpus is used for training, and both the OGI Kids’ Speech and CMU Kids Corpora are used for both training and testing. A single word ASR experiment and a continuous read speech ASR experiment are performed to evaluate the fo-based frequency normalization and data augmentation techniques. In the single word experiment, the system using fo-based frequency normalization significantly improved over the baseline system with no normalization, with a relative improvement of up to 22.3%, when the mismatch between training and testing data was large. In the continuous speech experiment, the combination of fo-based frequency normalization and data augmentation resulted in a relative improvement of 19.3% over the baseline. Additionally, in all experiments, the fo-based techniques outperformed other techniques such as vocal tract length normalization (VTLN) or vocal tract length perturbation (VTLP). Results were validated using Gaussian mixture model (GMM), deep neural network (DNN), and bidirectional long–short term memory (BLSTM) acoustic models.



中文翻译:

用于儿童自动语音识别中频率归一化和数据增强的基频特征扭曲

由于越来越多地使用交互技术,有效的儿童自动语音识别 (ASR) 系统变得越来越重要。由于缺乏公开可用的儿童语音数据库,幼儿 ASR 系统通常依赖年龄较大的儿童或成人语音作为训练数据。然而,儿童和成人语音之间存在很大的声学失配。本研究提出了一种新的基频(F) 的频率规整技术,用于频率归一化和数据增强,以对抗这种声学不匹配并解决可用的儿童语音训练数据的缺乏。该技术的灵感来自共振峰和F,开发用于模拟人类元音感知。音调距离被重新表述为之间的线性关系F和梅尔音阶的元音共振峰。此重新制定使用验证F和来自儿童话语的共振峰测量。该关系被进一步概括,使得频率扭曲技术仅依赖于两个参数。LibriSpeech ASR 语料库用于训练,OGI Kids' Speech 和 CMU Kids Corpora 均用于训练和测试。进行了单个单词 ASR 实验和连续朗读语音 ASR 实验以评估F基于频率归一化和数据增强技术。在单字实验中,系统采用F基于频率归一化在没有归一化的基线系统上有显着的改进,当训练和测试数据之间的不匹配很大时,相对改进高达 22.3%。在连续语音实验中,组合F基于频率归一化和数据增强导致相对于基线提高了 19.3%。此外,在所有实验中,F基于的技术优于其他技术,例如声道长度归一化 (VTLN) 或声道长度扰动 (VTLP)。结果使用高斯混合模型 (GMM)、深度神经网络 (DNN) 和双向长短期记忆 (BLSTM) 声学模型进行验证。

更新日期:2021-10-06
down
wechat
bug