End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture.,Sensors

当前位置： X-MOL 学术 › Sensors › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture.
Sensors ( IF 3.4 ) Pub Date : 2020-03-25 , DOI: 10.3390/s20071809
Long Zhang ₁ , Ziping Zhao ₁ , Chunmei Ma ₁ , Linlin Shan ₂ , Huazhi Sun ₁ , Lifen Jiang ₁ , Shiwen Deng ₃ , Chang Gao ₄

Affiliation

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network-deep neural network (DNN-DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

中文翻译：

基于改进的混合CTC /注意架构的端到端自动语音错误检测。

先进的自动语音错误检测（APED）算法通常基于最新的自动语音识别（ASR）技术。随着深度学习技术的发展，端到端ASR技术逐渐成熟并取得了积极的实践成果，这为我们提供了更新APED算法的新机会。我们首先基于混合连接主义的时间分类和注意力（CTC / attention）架构构建了一个端到端的ASR系统。自适应参数用于增强连接主义时间分类（CTC）模型和基于注意力的seq2seq模型的互补性，从而进一步提高ASR系统的性能。此后，将改进的ASR系统用于普通话的APED任务，并取得了良好的效果。这种新的APED方法无需进行力对齐和分段，并且不需要多个复杂的模型，例如声学模型或语言模型。它方便，直接，将成为独立于L1的计算机辅助发音训练（CAPT）的合适通用解决方案。此外，我们发现，关于准确性指标，我们基于改进的混合CTC /注意架构的拟议系统与基于深度神经网络-深度神经网络（DNN）的最新ASR系统非常接近-DNN）体系结构，并且对F度量指标有更强的影响，尤其适合APED任务的要求。它方便，直接，将成为独立于L1的计算机辅助发音训练（CAPT）的合适通用解决方案。此外，我们发现，关于准确性指标，我们基于改进的混合CTC /注意架构的拟议系统与基于深度神经网络-深度神经网络（DNN）的最新ASR系统非常接近-DNN）体系结构，并且对F度量指标有更强的影响，尤其适合APED任务的要求。它方便，直接，将成为独立于L1的计算机辅助发音训练（CAPT）的合适通用解决方案。此外，我们发现，在准确性指标方面，我们基于改进的混合CTC /注意架构的拟议系统与基于深度神经网络-深度神经网络（DNN）的最新ASR系统非常接近-DNN）体系结构，并且对F度量指标有更强的影响，尤其适合APED任务的要求。

更新日期：2020-03-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11