Abstract
This paper considers an assessment and evaluation of speech sound pronunciation quality in computer-aided language learning systems. We examine the gain optimization of spectral distortion measures between the speech signals of a native speaker and a learner. During training, a learner has to achieve stable pronunciation of all sounds. This is measured by computing the distances between the sounds produced by the learner and the model speaker. In order to improve pronunciation, it is proposed to adapt the linear prediction coding coefficients of reference sounds by using the gradient descent optimization of the gain-optimized dissimilarity. As a result, we demonstrate the possibility of synthesizing sounds that will be either close to the model pronunciation or achievable by a learner. An experimental study shows that the proposed procedure leads to high efficiency for pronunciation training even in the presence of noise in the observed utterance.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. The subset (speech-data.zip) for two speakers suitable to reproduce most of our experiments, is publicly available (https://drive.google.com/drive/folders/1bk1VGNP4fPGwPckCcC-5BvHXsGYgwtUx).
References
Agarwal, C., Chakraborty, P.: A review of tools and techniques for computer aided pronunciation training (CAPT) in English. Educ. Inf. Technol. 24(6), 3731–3743 (2019)
Arias, J.P., Yoma, N.B., Vivanco, H.: Automatic intonation assessment for computer aided language learning. Speech Commun. 52(3), 254–267 (2010)
Bastos, I., Oliveira, L.B., Goes, J., Silva, M.: MOSFET-only wideband LNA with noise cancelling and gain optimization. In: Proceedings of the 17th International Conference Mixed Design of Integrated Circuits and Systems (MIXDES), pp. 306–311. IEEE (2010)
Benesty, J., Sondhi, M.M., Huang, Y.: Springer Handbook of Speech Processing. Springer, Berlin (2007)
Ding, S., Liberatore, C., Sonsaat, S., Lučić, I., Silpachai, A., Zhao, G., Chukharev-Hudilainen, E., Levis, J., Gutierrez-Osuna, R.: Golden speaker builder-an interactive tool for pronunciation training. Speech Commun. 115, 51–66 (2019)
Dionelis, N., Brookes, M.: Speech enhancement using modulation-domain Kalman filtering with active speech level normalized log-spectrum global priors. In: Proceedings of the 25th European Signal Processing Conference (EUSIPCO), pp. 2309–2313. IEEE (2017)
Elaraby, M.S., Abdallah, M., Abdou, S., Rashwan, M.: A deep neural networks (DNN) based models for a computer aided pronunciation learning system. In: International Conference on Speech and Computer (SPECOM), pp. 51–58. Springer (2016)
Erkelens, J., Jensen, J., Heusdens, R.: A data-driven approach to optimizing spectral speech enhancement methods for various error criteria. Speech Commun. 49(7–8), 530–541 (2007)
Franco, H., Bratt, H., Rossier, R., Rao Gadde, V., Shriberg, E., Abrash, V., Precoda, K.: Eduspeak®: a speech recognition and pronunciation scoring toolkit for computer-aided language learning applications. Language Test. 27(3), 401–418 (2010)
Golonka, E.M., Bowles, A.R., Frank, V.M., Richardson, D.L., Freynik, S.: Technologies for foreign language learning: a review of technology types and their effectiveness. Comput. Assisted Language Learn. 27(1), 70–105 (2014)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press, Cambridge (2016)
Gray, R., Buzo, A., Gray, A., Matsuyama, Y.: Distortion measures for speech processing. IEEE Trans. Acoustics Speech Signal Process. 28(4), 367–376 (1980)
Haikun, T., Shiying, W., Xinsheng, L., Yue, X.G.: Speech recognition model based on deep learning and application in pronunciation quality evaluation system. In: Proceedings of the International Conference on Data Mining and Machine Learning, pp. 1–5 (2019)
Han, K.I., Park, H.J., Lee, K.M.: Speech recognition and lip shape feature extraction for English vowel pronunciation of the hearing-impaired based on SVM technique. In: Proceedings of the International Conference on Big Data and Smart Computing (BigComp), pp. 293–296. IEEE (2016)
Hu, W., Qian, Y., Soong, F.K.: A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In: Proceedings of Interspeech, pp. 1886–1890 (2013)
Huang, G., Ye, J., Shen, Y., Zhou, Y.: A evaluating model of English pronunciation for Chinese students. In: Proceedings of the 9th International Conference on Communication Software and Networks (ICCSN), pp. 1062–1065. IEEE (2017)
Itakura, F., Saito, S.: Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th International Congress on Acoustics, pp. 17–20 (1968)
Kneller, E., Karaulnyh, D.: System and method of converting voice signal into transcript presentation with metadata (2016). RU Patent 2589851 C2
Kullback, S.: Information Theory and Statistics. Dover Publications, New York (1997)
Marple, S.L., Jr.: Digital Spectral Analysis with Applications, 2nd edn. Courier Dover Publications, New York (2019)
Mošner, L., Wu, M., Raju, A., Parthasarathi, S.H.K., Kumatani, K., Sundaram, S., Maas, R., Hoffmeister, B.: Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6475–6479. IEEE (2019)
Savchenko, A.V., Savchenko, L.V.: Towards the creation of reliable voice control system based on a fuzzy approach. Pattern Recogn. Letts. 65, 145–151 (2015)
Savchenko, A.V., Savchenko, V.V., Savchenko, L.V.: Optimization of gain in symmetrized itakura-saito discrimination for pronunciation learning. In: Proceedings of International Conference on Mathematical Optimization Theory and Operations Research (MOTOR), pp. 440–454. Springer (2020)
Savchenko, L.V., Savchenko, A.V.: Fuzzy phonetic decoding method in a phoneme recognition problem. In: International Conference on Nonlinear Speech Processing (NOLISP), pp. 176–183. Springer (2013)
Savchenko, V.V.: Minimum of information divergence criterion for signals with tuning to speaker voice in automatic speech recognition. Radioelectron. Commun. Syst. 63(1), 42–54 (2020)
Savchenko, V.V., Savchenko, L.V.: Method for measuring the intelligibility of speech signals in the Kullback-Leibler information metric. Measurement Tech. 62(9), 832–839 (2019)
Srinivasan, A., Yarra, C., Ghosh, P.K.: Automatic assessment of pronunciation and its dependent factors by exploring their interdependencies using DNN and LSTM. In: Proceedings of the 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE), pp. 30–34 (2019)
Su, H.Y., Gao, Y.: Adaptive gain reduction for encoding a speech signal (2016). US Patent 9,269,365
Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K.: An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering hmm transition probabilities. Proceedings of Interspeech. pp. 954–958 (2019)
Sztahó, D., Kiss, G., Vicsi, K.: Computer based speech prosody teaching system. Comput Speech Language 50, 126–140 (2018)
Tejedor-García, C., Escudero, D., Cámara-Arenas, E., González-Ferreras, C., Cardeñoso-Payo, V.: Assessing pronunciation improvement in students of english using a controlled computer-assisted pronunciation tool. IEEE Transactions on Learning Technologies (2020)
Xiao, Y., Soong, F., Hu, W.: Paired phone-posteriors approach to ESL pronunciation quality assessment. Proceedings of Interspeech pp. 1631–1635 (2018)
Zhang, Z., Wang, Y., Yang, J.: Text-conditioned transformer for automatic pronunciation error detection. Speech Commun. 130, 55–63 (2021)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Section 4 was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE). The remaining work is supported by RSF (Russian Science Foundation) Grant 20-71-10010.
Rights and permissions
About this article
Cite this article
Savchenko, A.V., Savchenko, V.V. & Savchenko, L.V. Gain-optimized spectral distortions for pronunciation training. Optim Lett 16, 2095–2113 (2022). https://doi.org/10.1007/s11590-021-01790-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-021-01790-5