Skip to main content
Log in

Gain-optimized spectral distortions for pronunciation training

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

This paper considers an assessment and evaluation of speech sound pronunciation quality in computer-aided language learning systems. We examine the gain optimization of spectral distortion measures between the speech signals of a native speaker and a learner. During training, a learner has to achieve stable pronunciation of all sounds. This is measured by computing the distances between the sounds produced by the learner and the model speaker. In order to improve pronunciation, it is proposed to adapt the linear prediction coding coefficients of reference sounds by using the gradient descent optimization of the gain-optimized dissimilarity. As a result, we demonstrate the possibility of synthesizing sounds that will be either close to the model pronunciation or achievable by a learner. An experimental study shows that the proposed procedure leads to high efficiency for pronunciation training even in the presence of noise in the observed utterance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. The subset (speech-data.zip) for two speakers suitable to reproduce most of our experiments, is publicly available (https://drive.google.com/drive/folders/1bk1VGNP4fPGwPckCcC-5BvHXsGYgwtUx).

Notes

  1. https://drive.google.com/drive/folders/1bk1VGNP4fPGwPckCcC-5BvHXsGYgwtUx.

References

  1. Agarwal, C., Chakraborty, P.: A review of tools and techniques for computer aided pronunciation training (CAPT) in English. Educ. Inf. Technol. 24(6), 3731–3743 (2019)

    Article  Google Scholar 

  2. Arias, J.P., Yoma, N.B., Vivanco, H.: Automatic intonation assessment for computer aided language learning. Speech Commun. 52(3), 254–267 (2010)

    Article  Google Scholar 

  3. Bastos, I., Oliveira, L.B., Goes, J., Silva, M.: MOSFET-only wideband LNA with noise cancelling and gain optimization. In: Proceedings of the 17th International Conference Mixed Design of Integrated Circuits and Systems (MIXDES), pp. 306–311. IEEE (2010)

  4. Benesty, J., Sondhi, M.M., Huang, Y.: Springer Handbook of Speech Processing. Springer, Berlin (2007)

    Google Scholar 

  5. Ding, S., Liberatore, C., Sonsaat, S., Lučić, I., Silpachai, A., Zhao, G., Chukharev-Hudilainen, E., Levis, J., Gutierrez-Osuna, R.: Golden speaker builder-an interactive tool for pronunciation training. Speech Commun. 115, 51–66 (2019)

    Article  Google Scholar 

  6. Dionelis, N., Brookes, M.: Speech enhancement using modulation-domain Kalman filtering with active speech level normalized log-spectrum global priors. In: Proceedings of the 25th European Signal Processing Conference (EUSIPCO), pp. 2309–2313. IEEE (2017)

  7. Elaraby, M.S., Abdallah, M., Abdou, S., Rashwan, M.: A deep neural networks (DNN) based models for a computer aided pronunciation learning system. In: International Conference on Speech and Computer (SPECOM), pp. 51–58. Springer (2016)

  8. Erkelens, J., Jensen, J., Heusdens, R.: A data-driven approach to optimizing spectral speech enhancement methods for various error criteria. Speech Commun. 49(7–8), 530–541 (2007)

    Article  Google Scholar 

  9. Franco, H., Bratt, H., Rossier, R., Rao Gadde, V., Shriberg, E., Abrash, V., Precoda, K.: Eduspeak®: a speech recognition and pronunciation scoring toolkit for computer-aided language learning applications. Language Test. 27(3), 401–418 (2010)

    Article  Google Scholar 

  10. Golonka, E.M., Bowles, A.R., Frank, V.M., Richardson, D.L., Freynik, S.: Technologies for foreign language learning: a review of technology types and their effectiveness. Comput. Assisted Language Learn. 27(1), 70–105 (2014)

    Article  Google Scholar 

  11. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press, Cambridge (2016)

    MATH  Google Scholar 

  12. Gray, R., Buzo, A., Gray, A., Matsuyama, Y.: Distortion measures for speech processing. IEEE Trans. Acoustics Speech Signal Process. 28(4), 367–376 (1980)

    Article  Google Scholar 

  13. Haikun, T., Shiying, W., Xinsheng, L., Yue, X.G.: Speech recognition model based on deep learning and application in pronunciation quality evaluation system. In: Proceedings of the International Conference on Data Mining and Machine Learning, pp. 1–5 (2019)

  14. Han, K.I., Park, H.J., Lee, K.M.: Speech recognition and lip shape feature extraction for English vowel pronunciation of the hearing-impaired based on SVM technique. In: Proceedings of the International Conference on Big Data and Smart Computing (BigComp), pp. 293–296. IEEE (2016)

  15. Hu, W., Qian, Y., Soong, F.K.: A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In: Proceedings of Interspeech, pp. 1886–1890 (2013)

  16. Huang, G., Ye, J., Shen, Y., Zhou, Y.: A evaluating model of English pronunciation for Chinese students. In: Proceedings of the 9th International Conference on Communication Software and Networks (ICCSN), pp. 1062–1065. IEEE (2017)

  17. Itakura, F., Saito, S.: Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th International Congress on Acoustics, pp. 17–20 (1968)

  18. Kneller, E., Karaulnyh, D.: System and method of converting voice signal into transcript presentation with metadata (2016). RU Patent 2589851 C2

  19. Kullback, S.: Information Theory and Statistics. Dover Publications, New York (1997)

    MATH  Google Scholar 

  20. Marple, S.L., Jr.: Digital Spectral Analysis with Applications, 2nd edn. Courier Dover Publications, New York (2019)

    Google Scholar 

  21. Mošner, L., Wu, M., Raju, A., Parthasarathi, S.H.K., Kumatani, K., Sundaram, S., Maas, R., Hoffmeister, B.: Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6475–6479. IEEE (2019)

  22. Savchenko, A.V., Savchenko, L.V.: Towards the creation of reliable voice control system based on a fuzzy approach. Pattern Recogn. Letts. 65, 145–151 (2015)

    Article  Google Scholar 

  23. Savchenko, A.V., Savchenko, V.V., Savchenko, L.V.: Optimization of gain in symmetrized itakura-saito discrimination for pronunciation learning. In: Proceedings of International Conference on Mathematical Optimization Theory and Operations Research (MOTOR), pp. 440–454. Springer (2020)

  24. Savchenko, L.V., Savchenko, A.V.: Fuzzy phonetic decoding method in a phoneme recognition problem. In: International Conference on Nonlinear Speech Processing (NOLISP), pp. 176–183. Springer (2013)

  25. Savchenko, V.V.: Minimum of information divergence criterion for signals with tuning to speaker voice in automatic speech recognition. Radioelectron. Commun. Syst. 63(1), 42–54 (2020)

    Article  MathSciNet  Google Scholar 

  26. Savchenko, V.V., Savchenko, L.V.: Method for measuring the intelligibility of speech signals in the Kullback-Leibler information metric. Measurement Tech. 62(9), 832–839 (2019)

    Article  Google Scholar 

  27. Srinivasan, A., Yarra, C., Ghosh, P.K.: Automatic assessment of pronunciation and its dependent factors by exploring their interdependencies using DNN and LSTM. In: Proceedings of the 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE), pp. 30–34 (2019)

  28. Su, H.Y., Gao, Y.: Adaptive gain reduction for encoding a speech signal (2016). US Patent 9,269,365

  29. Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K.: An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering hmm transition probabilities. Proceedings of Interspeech. pp. 954–958 (2019)

  30. Sztahó, D., Kiss, G., Vicsi, K.: Computer based speech prosody teaching system. Comput Speech Language 50, 126–140 (2018)

    Article  Google Scholar 

  31. Tejedor-García, C., Escudero, D., Cámara-Arenas, E., González-Ferreras, C., Cardeñoso-Payo, V.: Assessing pronunciation improvement in students of english using a controlled computer-assisted pronunciation tool. IEEE Transactions on Learning Technologies (2020)

  32. Xiao, Y., Soong, F., Hu, W.: Paired phone-posteriors approach to ESL pronunciation quality assessment. Proceedings of Interspeech pp. 1631–1635 (2018)

  33. Zhang, Z., Wang, Y., Yang, J.: Text-conditioned transformer for automatic pronunciation error detection. Speech Commun. 130, 55–63 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrey V. Savchenko.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Section 4 was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE). The remaining work is supported by RSF (Russian Science Foundation) Grant 20-71-10010.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Savchenko, A.V., Savchenko, V.V. & Savchenko, L.V. Gain-optimized spectral distortions for pronunciation training. Optim Lett 16, 2095–2113 (2022). https://doi.org/10.1007/s11590-021-01790-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-021-01790-5

Keywords

Navigation