Abstract
Detecting pronunciation erroneous tendency (PET) can provide detailed instructive feedback for second language learners in computer aided pronunciation training (CAPT). In this paper, we proposed to apply soft targets from various models to improve the detection performance of PET. First, we examined the effectiveness of soft targets in three single systems by replacing hard targets with soft targets directly for mispronunciation detection. Furthermore, we proposed two kinds of methods using multi-model soft targets in this paper: 1) explicit combination, which used multi-model soft targets as the final targets by weighted linear combination; 2) implicit combination, which employed the multi-task framework to combine soft targets. Experimental results showed that the detection performance of PET could be improved by using both single soft targets and multi-model soft targets. Moreover, using multi-model soft targets within multi-task framework achieved the best results in pronunciation error detection task, and it was more efficient than conventional ensemble methods which required multiple decoding runs or forward passes.
Similar content being viewed by others
References
Witt, S.M. (2012). Automatic error detection in pronunciation training: where we are and where we need to go. Proc IS ADEPT, vol. 6, pp. 1–8.
Witt, S.M., & Young, S.J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, 30(2-3), 95–108.
Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R. (2008). Automatic mispronunciation detection for mandarin. In ICASSP 2008. IEEE international conference on acoustics, speech and signal processing, 2008 (pp. 5077–5080): IEEE.
Wang, Y.-B., & Lee, L.-S. (2012). Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5049–5052): IEEE.
Lo, W. -K., Zhang, S., Meng, H. (2010). Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. In Eleventh annual conference of the international speech communication association.
Truong, K., Neri, A., Cucchiarini, C., Strik, H. (2004). Automatic pronunciation error detection: an acoustic-phonetic approach. In STIL/ICALL symposium 2004.
Hu, W., Qian, Y., Soong, F.K., Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–166.
Lin, J., Xie, Y., Zhang, J. (2016). Automatic pronunciation evaluation of non-native mandarin tone by using multi-level confidence measures. In Interspeech (pp. 2666–2670).
Li, W., Chen, N.F., Siniscalchi, S.M., Lee, C.-H. (2017). Improving mispronunciation detection for non-native learners with multisource information and lstm-based deep models, Proc. Interspeech, 2017, 2759–2763.
Yoon, S.-Y., Hasegawa-Johnson, M., Sproat, R. (2010). Landmark-based automated pronunciation error detection. In Eleventh annual conference of the international speech communication association.
Cao, W., Wang, D., Zhang, J., Xiong, Z. (2010). Developing a Chinese l2 speech database of japanese learners with narrow-phonetic labels for computer assisted pronunciation training. In Eleventh annual conference of the international speech communication association.
Gao, Y., Xie, Y., Cao, W., Zhang, J. (2015). A study on robust detection of pronunciation erroneous tendency based on deep neural network. In Sixteenth annual conference of the international speech communication association.
Gao, Y., Xie, Y., Lin, J., Zhang, J. (2016). Dnn based detection of pronunciation erroneous tendency in data sparse condition. In Signal and information processing association annual summit and conference (APSIPA), 2016 Asia-Pacific (pp. 1–5): IEEE.
Duan, R., Zhang, J., Cao, W., Xie, Y. (2014). A preliminary study on asr-based detection of Chinese mispronunciation by japanese learners. In Fifteenth annual conference of the international speech communication association.
Qu, L., Xie, Y., Zhang, J. (2016). Senone log-likelihood ratios based articulatory features in pronunciation erroneous tendency detecting. In 2016 10th international symposium on Chinese spoken language processing (ISCSLP) (pp. 1–5): IEEE.
Wong, J.H., & Gales, M.J. (2016). Sequence student-teacher training of deep neural networks.
Tang, Z., Wang, D., Zhang, Z. (2016). Recurrent neural network training with dark knowledge transfer. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5900–5904): IEEE.
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In Proceedings., 1997 IEEE workshop on automatic speech recognition and understanding, 1997 (pp. 347–354): IEEE.
Evermann, G., & Woodland, P. (2000). Posterior probability decoding, confidence estimation and system combination. In Proceedings Speech Transcription Workshop, vol. 27. Baltimore (pp. 78–81).
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Xie, X. (2010). A study on japanese learner’s acquisition process of mandarin balade-palatal initials. Journal of Jilin Teachers Institute of Engineering and Technology.
Li, F., & Cao, W. (2011). Comparative study on the acoustic characteristic of phoneme/u/in mandarin between Chinese native speakers and japanese learners. Chinese Master’s Thesis Full-text Database, no. S1.
Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing.
Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to mpe for large scale discriminative training. In ICASSP 2007. IEEE international conference on acoustics, speech and signal processing, 2007, (Vol. 4 pp. IV–321): IEEE.
Xu, H., Povey, D., Mangu, L., Zhu, J. (2011). Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828.
Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the knowledge in a neural network, arXiv preprint arXiv:http://arXiv.org/abs/1503.02531.
Bersini, H., & Gorrini, V. (1997). A simplification of the backpropagation-through-time algorithm for optimal neurocontrol. IEEE Transactions on Neural Networks, 8(2), 437–441.
Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.
Metze, F. (2005). Articulatory features for conversational speech recognition.
Duan, R., Kawahara, T., Dantsuji, M., Zhang, J. (2017). Articulatory modeling for pronunciation error detection without non-native training data based on dnn transfer learning. IEICE Transactions on Information and Systems, 100(9), 2174–2182.
Acknowledgements
This work is supported by Discipline Team Support Program of Beijing Language and Culture University (Grant No. GF201906), Advanced Innovation Center for Language Resource and Intelligence (Grant No. KYR17005), the Special Program for Key Basic Research fund of Beijing Language and Culture University (the Fundamental Research Funds for the Central Universities, Grant No. 16ZDJ03).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lin, J., Gao, Y., Zhang, W. et al. Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets. J Sign Process Syst 92, 793–803 (2020). https://doi.org/10.1007/s11265-019-01485-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-01485-2