End-to-end acoustic modelling for phone recognition of young readers,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-end acoustic modelling for phone recognition of young readers
Speech Communication ( IF 3.2 ) Pub Date : 2021-09-22 , DOI: 10.1016/j.specom.2021.08.003
Lucile Gelin _{1,

2} , Morgane Daniel ₂ , Julien Pinquier ₁ , Thomas Pellegrini ₁

Affiliation

Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers’ speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model complemented with a Connectionist Temporal Classification (CTC) objective function, reaches a phone error rate of 28.1%, outperforming a state-of-the-art DNN–HMM model by 6.6% relative, as well as other end-to-end architectures by more than 8.5% relative. An analysis of the models’ performance on two specific reading tasks (isolated words and sentences) is provided, showing the influence of the utterance length on attention-based and CTC-based models. The Transformer+CTC model displays an ability to better detect reading mistakes made by children, which can be attributed to the CTC objective function effectively constraining the attention mechanisms to be monotonic.

中文翻译：

用于年轻读者电话识别的端到端声学建模

儿童语音的自动识别系统在性能竞赛中落后于专门用于成人语音的系统。这种现象是由于儿童的身体发育导致其语音中存在高度的声学和语言可变性，以及缺乏可用的儿童语音数据。年轻读者的演讲还表现出特殊性，例如阅读速度慢和存在阅读错误，这使任务变得困难。这项工作试图用有限的数据解决幼儿语音电话声学建模的主要挑战，并提高对该领域广泛选择的模型架构的优缺点的理解。我们发现，迁移学习技术在端到端架构上非常有效，用于使用少量儿童语音数据进行成人到儿童的适应。通过迁移学习，Transformer 模型与连接主义时间分类 (CTC) 目标函数相辅相成，电话错误率达到 28.1%，相对于最先进的 DNN-HMM 模型高出 6.6%，以及其他端到端架构相对高出 8.5% 以上。提供了模型在两个特定阅读任务（孤立词和句子）上的性能分析，显示了话语长度对基于注意力和基于 CTC 的模型的影响。Transformer+CTC 模型显示出更好地检测儿童阅读错误的能力，这可以归因于 CTC 目标函数有效地将注意力机制限制为单调。达到 28.1% 的电话错误率，相对于最先进的 DNN-HMM 模型高出 6.6%，相对于其他端到端架构高出 8.5% 以上。提供了模型在两个特定阅读任务（孤立词和句子）上的性能分析，显示了话语长度对基于注意力和基于 CTC 的模型的影响。Transformer+CTC 模型显示出更好地检测儿童阅读错误的能力，这可以归因于 CTC 目标函数有效地将注意力机制限制为单调。达到 28.1% 的电话错误率，相对于最先进的 DNN-HMM 模型高出 6.6%，相对于其他端到端架构高出 8.5% 以上。提供了模型在两个特定阅读任务（孤立词和句子）上的性能分析，显示了话语长度对基于注意力和基于 CTC 的模型的影响。Transformer+CTC 模型显示出更好地检测儿童阅读错误的能力，这可以归因于 CTC 目标函数有效地将注意力机制限制为单调。显示了话语长度对基于注意力和基于 CTC 的模型的影响。Transformer+CTC 模型显示出更好地检测儿童阅读错误的能力，这可以归因于 CTC 目标函数有效地将注意力机制限制为单调。显示了话语长度对基于注意力和基于 CTC 的模型的影响。Transformer+CTC 模型显示出更好地检测儿童阅读错误的能力，这可以归因于 CTC 目标函数有效地将注意力机制限制为单调。

更新日期：2021-09-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>