L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks,Journal of Signal Processing Systems

当前位置： X-MOL 学术 › J. Sign. Process. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks
Journal of Signal Processing Systems ( IF 1.6 ) Pub Date : 2020-09-24 , DOI: 10.1007/s11265-020-01598-z
Yanlu Xie , Zhenyu Wang , Kaiqi Fu

In computer-assisted pronunciation training (CAPT) system, feedback for non-native mispronunciation verification is important, for the reason that it is beneficial to the second language (L2) learners in respect of pronunciation improving. In pronunciation evaluation at the phone level, the pairwise distances between embeddings of native phones and non-native phones could be an ideal predictor of L2 learners’ proficiency. In CAPT, there are two key research issues to be addressed, one is mispronunciation verification and the other is pronunciation evaluation. Considering the positive role played by phone embedding and Siamese networks in related fields, we proposed to evaluate L2 learners’ pronunciation based on phone embedding and Siamese networks. Arbitrary-length speech segments corresponding to phones can be projected into acoustic phone embeddings space as fixed-dimensional vector representations. For system inputs, what is used is a pair of acoustic feature vectors of phone segments. The vectors are pair-wise labeled. And the Siamese networks will encode the feature vectors to phone embeddings as high-level representation. Thus, we can differentiate each type of phones through the similarities of their embeddings. As a result, in terms of diagnostic accuracy in mispronunciation verification tasks, Based on bi-directional Long Short Term Memory (LSTM) and contrastive loss, Siamese networks can be trained by a self-supervision using the pairwise labeled vectors without any mispronunciation-labeled L2 speech data in the training set. Results show that the proposed networks surpassed other methods and achieve accuracy as high as 90.69%.

中文翻译：

基于语音电话嵌入和连体网络的二层误判验证

在计算机辅助的语音训练（CAPT）系统中，非本机发音识别的反馈很重要，因为它对于提高语音水平对第二语言（L2）学习者有益。在电话级别的语音评估中，本地电话和非本地电话的嵌入之间的成对距离可能是L2学习者熟练程度的理想预测指标。在CAPT中，有两个关键的研究问题要解决，一个是发音错误验证，另一个是语音评估。考虑到电话嵌入和暹罗网络在相关领域中所起的积极作用，我们建议基于电话嵌入和暹罗网络来评估二语学习者的发音。可以将与电话相对应的任意长度的语音段作为固定维矢量表示投影到声学电话嵌入空间中。对于系统输入，使用的是电话段的一对声学特征向量。这些载体是成对标记的。暹罗网络将特征向量编码为电话嵌入，作为高级表示。因此，我们可以通过嵌入方式的相似性来区分每种类型的手机。结果，就错误发音验证任务的诊断准确性而言，基于双向长期短期记忆（LSTM）和对比损失，可以使用成对标记的向量通过自我监督来训练暹罗网络，而无需对任何错误发音进行标记训练集中的L2语音数据。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文