End-to-end acoustic modelling for phone recognition of young readers,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-end acoustic modelling for phone recognition of young readers
arXiv - CS - Sound Pub Date : 2021-03-04 , DOI: arxiv-2103.02899
Lucile Gelin, Morgane Daniel, Julien Pinquier, Thomas Pellegrini

Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data, and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model complemented with a Connectionist Temporal Classification (CTC) objective function, reaches a phone error rate of 28.1%, outperforming a state-of-the-art DNN-HMM model by 6.6% relative, as well as other end-to-end architectures by more than 8.5% relative. An analysis of the models' performance on two specific reading tasks (isolated words and sentences) is provided, showing the influence of the utterance length on attention-based and CTC-based models. The Transformer+CTC model displays an ability to better detect reading mistakes made by children, that can be attributed to the CTC objective function effectively constraining the attention mechanisms to be monotonic.

中文翻译：

端到端声学建模，可识别年轻读者的电话

在表演比赛中，儿童语音自动识别系统落后于成人语音专用自动识别系统。这种现象是由于儿童语音的身体发育导致其语音和语言的高度可变性，以及缺少可用的儿童语音数据所致。年轻读者的言语还显示出一些特质，例如阅读速度慢和存在阅读错误，这使这项工作变得困难。这项工作试图解决数据有限的幼儿语音电话声学建模中的主要挑战，并增强对该领域各种模型架构的优缺点的理解。我们发现，转移学习技术在端到端体系结构上具有少量儿童语音数据的成人到儿童适应是高效的。通过转移学习，Transformer模型与Connectionist时间分类（CTC）目标函数的补充，达到28.1％的电话错误率，相对于最新的DNN-HMM模型，其相对错误率为6.6％，以及其他端到端体系结构的相对增长率超过8.5％。提供了对两个特定阅读任务（孤立的单词和句子）的模型表现的分析，显示了言语长度对基于注意力和基于CTC的模型的影响。Transformer + CTC模型显示了一种能够更好地检测儿童阅读错误的能力，这可以归因于CTC目标函数，从而有效地将注意力机制限制为单调。达到28.1％的电话错误率，相对于最新的DNN-HMM模型，相对误差为6.6％，而其他端到端架构的相对误差则超过8.5％。提供了对两个特定阅读任务（孤立的单词和句子）的模型表现的分析，显示了言语长度对基于注意力和基于CTC的模型的影响。Transformer + CTC模型显示了一种能够更好地检测儿童阅读错误的能力，这可以归因于CTC目标函数，从而有效地将注意力机制限制为单调。达到28.1％的电话错误率，相对于最新的DNN-HMM模型，相对误差为6.6％，而其他端到端架构的相对误差则超过8.5％。提供了对两个特定阅读任务（孤立的单词和句子）的模型表现的分析，显示了言语长度对基于注意力和基于CTC的模型的影响。Transformer + CTC模型显示了一种能够更好地检测儿童阅读错误的能力，这可以归因于CTC目标函数，从而有效地将注意力机制限制为单调。提供了两个特定阅读任务（孤立的单词和句子）的表现，显示了言语长度对基于注意力和基于CTC的模型的影响。Transformer + CTC模型显示了一种能够更好地检测儿童阅读错误的能力，这可以归因于CTC目标函数，从而有效地将注意力机制限制为单调。提供了两个特定阅读任务（孤立的单词和句子）的表现，显示了言语长度对基于注意力和基于CTC的模型的影响。Transformer + CTC模型显示了一种能够更好地检测儿童阅读错误的能力，这可以归因于CTC目标函数，从而有效地将注意力机制限制为单调。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文