Skip to main content
Log in

Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Detecting pronunciation erroneous tendency (PET) can provide detailed instructive feedback for second language learners in computer aided pronunciation training (CAPT). In this paper, we proposed to apply soft targets from various models to improve the detection performance of PET. First, we examined the effectiveness of soft targets in three single systems by replacing hard targets with soft targets directly for mispronunciation detection. Furthermore, we proposed two kinds of methods using multi-model soft targets in this paper: 1) explicit combination, which used multi-model soft targets as the final targets by weighted linear combination; 2) implicit combination, which employed the multi-task framework to combine soft targets. Experimental results showed that the detection performance of PET could be improved by using both single soft targets and multi-model soft targets. Moreover, using multi-model soft targets within multi-task framework achieved the best results in pronunciation error detection task, and it was more efficient than conventional ensemble methods which required multiple decoding runs or forward passes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. Witt, S.M. (2012). Automatic error detection in pronunciation training: where we are and where we need to go. Proc IS ADEPT, vol. 6, pp. 1–8.

  2. Witt, S.M., & Young, S.J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, 30(2-3), 95–108.

    Article  Google Scholar 

  3. Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R. (2008). Automatic mispronunciation detection for mandarin. In ICASSP 2008. IEEE international conference on acoustics, speech and signal processing, 2008 (pp. 5077–5080): IEEE.

  4. Wang, Y.-B., & Lee, L.-S. (2012). Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5049–5052): IEEE.

  5. Lo, W. -K., Zhang, S., Meng, H. (2010). Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. In Eleventh annual conference of the international speech communication association.

  6. Truong, K., Neri, A., Cucchiarini, C., Strik, H. (2004). Automatic pronunciation error detection: an acoustic-phonetic approach. In STIL/ICALL symposium 2004.

  7. Hu, W., Qian, Y., Soong, F.K., Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–166.

    Article  Google Scholar 

  8. Lin, J., Xie, Y., Zhang, J. (2016). Automatic pronunciation evaluation of non-native mandarin tone by using multi-level confidence measures. In Interspeech (pp. 2666–2670).

  9. Li, W., Chen, N.F., Siniscalchi, S.M., Lee, C.-H. (2017). Improving mispronunciation detection for non-native learners with multisource information and lstm-based deep models, Proc. Interspeech, 2017, 2759–2763.

    Article  Google Scholar 

  10. Yoon, S.-Y., Hasegawa-Johnson, M., Sproat, R. (2010). Landmark-based automated pronunciation error detection. In Eleventh annual conference of the international speech communication association.

  11. Cao, W., Wang, D., Zhang, J., Xiong, Z. (2010). Developing a Chinese l2 speech database of japanese learners with narrow-phonetic labels for computer assisted pronunciation training. In Eleventh annual conference of the international speech communication association.

  12. Gao, Y., Xie, Y., Cao, W., Zhang, J. (2015). A study on robust detection of pronunciation erroneous tendency based on deep neural network. In Sixteenth annual conference of the international speech communication association.

  13. Gao, Y., Xie, Y., Lin, J., Zhang, J. (2016). Dnn based detection of pronunciation erroneous tendency in data sparse condition. In Signal and information processing association annual summit and conference (APSIPA), 2016 Asia-Pacific (pp. 1–5): IEEE.

  14. Duan, R., Zhang, J., Cao, W., Xie, Y. (2014). A preliminary study on asr-based detection of Chinese mispronunciation by japanese learners. In Fifteenth annual conference of the international speech communication association.

  15. Qu, L., Xie, Y., Zhang, J. (2016). Senone log-likelihood ratios based articulatory features in pronunciation erroneous tendency detecting. In 2016 10th international symposium on Chinese spoken language processing (ISCSLP) (pp. 1–5): IEEE.

  16. Wong, J.H., & Gales, M.J. (2016). Sequence student-teacher training of deep neural networks.

  17. Tang, Z., Wang, D., Zhang, Z. (2016). Recurrent neural network training with dark knowledge transfer. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5900–5904): IEEE.

  18. Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In Proceedings., 1997 IEEE workshop on automatic speech recognition and understanding, 1997 (pp. 347–354): IEEE.

  19. Evermann, G., & Woodland, P. (2000). Posterior probability decoding, confidence estimation and system combination. In Proceedings Speech Transcription Workshop, vol. 27. Baltimore (pp. 78–81).

  20. Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.

    Article  MathSciNet  Google Scholar 

  21. Xie, X. (2010). A study on japanese learner’s acquisition process of mandarin balade-palatal initials. Journal of Jilin Teachers Institute of Engineering and Technology.

  22. Li, F., & Cao, W. (2011). Comparative study on the acoustic characteristic of phoneme/u/in mandarin between Chinese native speakers and japanese learners. Chinese Master’s Thesis Full-text Database, no. S1.

  23. Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing.

  24. Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to mpe for large scale discriminative training. In ICASSP 2007. IEEE international conference on acoustics, speech and signal processing, 2007, (Vol. 4 pp. IV–321): IEEE.

  25. Xu, H., Povey, D., Mangu, L., Zhu, J. (2011). Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828.

    Article  Google Scholar 

  26. Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the knowledge in a neural network, arXiv preprint arXiv:http://arXiv.org/abs/1503.02531.

  27. Bersini, H., & Gorrini, V. (1997). A simplification of the backpropagation-through-time algorithm for optimal neurocontrol. IEEE Transactions on Neural Networks, 8(2), 437–441.

    Article  Google Scholar 

  28. Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533.

    Article  Google Scholar 

  29. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.

  30. Metze, F. (2005). Articulatory features for conversational speech recognition.

  31. Duan, R., Kawahara, T., Dantsuji, M., Zhang, J. (2017). Articulatory modeling for pronunciation error detection without non-native training data based on dnn transfer learning. IEICE Transactions on Information and Systems, 100(9), 2174–2182.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Discipline Team Support Program of Beijing Language and Culture University (Grant No. GF201906), Advanced Innovation Center for Language Resource and Intelligence (Grant No. KYR17005), the Special Program for Key Basic Research fund of Beijing Language and Culture University (the Fundamental Research Funds for the Central Universities, Grant No. 16ZDJ03).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinsong Zhang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Gao, Y., Zhang, W. et al. Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets. J Sign Process Syst 92, 793–803 (2020). https://doi.org/10.1007/s11265-019-01485-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-019-01485-2

Keywords

Navigation