Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets

Lin, Ju; Gao, Yingming; Zhang, Wei; Wei, Linxuan; Xie, Yanlu; Zhang, Jinsong

doi:10.1007/s11265-019-01485-2

Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets

Published: 24 March 2020

Volume 92, pages 793–803, (2020)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Ju Lin ORCID: orcid.org/0000-0002-6970-4247^1,2,
Yingming Gao³,
Wei Zhang⁴,
Linxuan Wei⁴,
Yanlu Xie^1,4 &
…
Jinsong Zhang^1,4

317 Accesses
6 Citations
Explore all metrics

Abstract

Detecting pronunciation erroneous tendency (PET) can provide detailed instructive feedback for second language learners in computer aided pronunciation training (CAPT). In this paper, we proposed to apply soft targets from various models to improve the detection performance of PET. First, we examined the effectiveness of soft targets in three single systems by replacing hard targets with soft targets directly for mispronunciation detection. Furthermore, we proposed two kinds of methods using multi-model soft targets in this paper: 1) explicit combination, which used multi-model soft targets as the final targets by weighted linear combination; 2) implicit combination, which employed the multi-task framework to combine soft targets. Experimental results showed that the detection performance of PET could be improved by using both single soft targets and multi-model soft targets. Moreover, using multi-model soft targets within multi-task framework achieved the best results in pronunciation error detection task, and it was more efficient than conventional ensemble methods which required multiple decoding runs or forward passes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

Transformer models for text-based emotion detection: a review of BERT-based approaches

Article 08 February 2021

Francisca Adoma Acheampong, Henry Nunoo-Mensah & Wenyu Chen

How to Fine-Tune BERT for Text Classification?

References

Witt, S.M. (2012). Automatic error detection in pronunciation training: where we are and where we need to go. Proc IS ADEPT, vol. 6, pp. 1–8.
Witt, S.M., & Young, S.J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, 30(2-3), 95–108.
Article Google Scholar
Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R. (2008). Automatic mispronunciation detection for mandarin. In ICASSP 2008. IEEE international conference on acoustics, speech and signal processing, 2008 (pp. 5077–5080): IEEE.
Wang, Y.-B., & Lee, L.-S. (2012). Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5049–5052): IEEE.
Lo, W. -K., Zhang, S., Meng, H. (2010). Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. In Eleventh annual conference of the international speech communication association.
Truong, K., Neri, A., Cucchiarini, C., Strik, H. (2004). Automatic pronunciation error detection: an acoustic-phonetic approach. In STIL/ICALL symposium 2004.
Hu, W., Qian, Y., Soong, F.K., Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–166.
Article Google Scholar
Lin, J., Xie, Y., Zhang, J. (2016). Automatic pronunciation evaluation of non-native mandarin tone by using multi-level confidence measures. In Interspeech (pp. 2666–2670).
Li, W., Chen, N.F., Siniscalchi, S.M., Lee, C.-H. (2017). Improving mispronunciation detection for non-native learners with multisource information and lstm-based deep models, Proc. Interspeech, 2017, 2759–2763.
Article Google Scholar
Yoon, S.-Y., Hasegawa-Johnson, M., Sproat, R. (2010). Landmark-based automated pronunciation error detection. In Eleventh annual conference of the international speech communication association.
Cao, W., Wang, D., Zhang, J., Xiong, Z. (2010). Developing a Chinese l2 speech database of japanese learners with narrow-phonetic labels for computer assisted pronunciation training. In Eleventh annual conference of the international speech communication association.
Gao, Y., Xie, Y., Cao, W., Zhang, J. (2015). A study on robust detection of pronunciation erroneous tendency based on deep neural network. In Sixteenth annual conference of the international speech communication association.
Gao, Y., Xie, Y., Lin, J., Zhang, J. (2016). Dnn based detection of pronunciation erroneous tendency in data sparse condition. In Signal and information processing association annual summit and conference (APSIPA), 2016 Asia-Pacific (pp. 1–5): IEEE.
Duan, R., Zhang, J., Cao, W., Xie, Y. (2014). A preliminary study on asr-based detection of Chinese mispronunciation by japanese learners. In Fifteenth annual conference of the international speech communication association.
Qu, L., Xie, Y., Zhang, J. (2016). Senone log-likelihood ratios based articulatory features in pronunciation erroneous tendency detecting. In 2016 10th international symposium on Chinese spoken language processing (ISCSLP) (pp. 1–5): IEEE.
Wong, J.H., & Gales, M.J. (2016). Sequence student-teacher training of deep neural networks.
Tang, Z., Wang, D., Zhang, Z. (2016). Recurrent neural network training with dark knowledge transfer. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5900–5904): IEEE.
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In Proceedings., 1997 IEEE workshop on automatic speech recognition and understanding, 1997 (pp. 347–354): IEEE.
Evermann, G., & Woodland, P. (2000). Posterior probability decoding, confidence estimation and system combination. In Proceedings Speech Transcription Workshop, vol. 27. Baltimore (pp. 78–81).
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Article MathSciNet Google Scholar
Xie, X. (2010). A study on japanese learner’s acquisition process of mandarin balade-palatal initials. Journal of Jilin Teachers Institute of Engineering and Technology.
Li, F., & Cao, W. (2011). Comparative study on the acoustic characteristic of phoneme/u/in mandarin between Chinese native speakers and japanese learners. Chinese Master’s Thesis Full-text Database, no. S1.
Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing.
Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to mpe for large scale discriminative training. In ICASSP 2007. IEEE international conference on acoustics, speech and signal processing, 2007, (Vol. 4 pp. IV–321): IEEE.
Xu, H., Povey, D., Mangu, L., Zhu, J. (2011). Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828.
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the knowledge in a neural network, arXiv preprint arXiv:http://arXiv.org/abs/1503.02531.
Bersini, H., & Gorrini, V. (1997). A simplification of the backpropagation-through-time algorithm for optimal neurocontrol. IEEE Transactions on Neural Networks, 8(2), 437–441.
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533.
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.
Metze, F. (2005). Articulatory features for conversational speech recognition.
Duan, R., Kawahara, T., Dantsuji, M., Zhang, J. (2017). Articulatory modeling for pronunciation error detection without non-native training data based on dnn transfer learning. IEICE Transactions on Information and Systems, 100(9), 2174–2182.
Article Google Scholar

Download references

Acknowledgements

This work is supported by Discipline Team Support Program of Beijing Language and Culture University (Grant No. GF201906), Advanced Innovation Center for Language Resource and Intelligence (Grant No. KYR17005), the Special Program for Key Basic Research fund of Beijing Language and Culture University (the Fundamental Research Funds for the Central Universities, Grant No. 16ZDJ03).

Author information

Authors and Affiliations

Beijing Advanced Innovation Center for Language Resources, Beijing, China
Ju Lin, Yanlu Xie & Jinsong Zhang
Clemson University, Clemson, SC, 29634, USA
Ju Lin
Dresden University of Technology, Dresden, Germany
Yingming Gao
Beijing Language and Culture University, Beijing, China
Wei Zhang, Linxuan Wei, Yanlu Xie & Jinsong Zhang

Authors

Ju Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yingming Gao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Linxuan Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yanlu Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinsong Zhang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, J., Gao, Y., Zhang, W. et al. Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets. J Sign Process Syst 92, 793–803 (2020). https://doi.org/10.1007/s11265-019-01485-2

Download citation

Received: 14 February 2019
Revised: 18 June 2019
Accepted: 11 September 2019
Published: 24 March 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11265-019-01485-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Transformer models for text-based emotion detection: a review of BERT-based approaches

How to Fine-Tune BERT for Text Classification?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Transformer models for text-based emotion detection: a review of BERT-based approaches

How to Fine-Tune BERT for Text Classification?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation