Skip to main content
Log in

A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Nowadays, the demand for language learning is increasing because people need to communicate with other people belonging to different regions for their business deals, study, etc. During language learning, a lot of pronunciation mistakes occur due to unfamiliarity with a new language and differences in accent. In this paper, we perform speech mistakes analysis using deep feature-based clustering. We proposed two novel methods for speech analysis, one to deal with phonemic errors (confusing phonemes) and the other to deal with the prosodic errors (partially changed pronunciation variation of phones). For accurate and efficient language learning, it is important to learn both phonemic as well as prosodic error corrections. In our first method, we perform speech analysis by combining deep CNN features and clustering algorithm to detect the phonemic errors. We classify the phonemes using K-nearest neighbor, Naïve Bayes, and support vector machine (SVM). We perform experiments on the six most frequently mispronounced confusing pairs of Arabic to handle phonemic errors and achieve an accuracy of 94%. In our second method, we proposed the unsupervised phone variation model (PVM) to detect prosodic errors. In PVM, each phone is extended to represent the different types of pronunciation variation of that phone with different proficiency levels. We use an Arabic dataset of 28 individual phones for speech analysis and provide feedback based on the variation of each phone and achieves an accuracy of 97%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of data and material

The data can be provided on request.

Code availability

Not applicable.

References

  1. Precoda, K., Halverson, C.A., Franco, H.: Effects of speech recognition-based pronunciation feedback on second-language pronunciation ability. Proc. InSTILL 2000, 102–105 (2000)

    Google Scholar 

  2. Panda, S.P., Nayak, A.K.: An efficient model for text-to-speech synthesis in Indian languages. Int. J. Speech Technol. 18(3), 305–315 (2015)

    Article  Google Scholar 

  3. Franco, H., Neumeyer, L., Kim, Y., Ronen, O.: Automatic pronunciation scoring for language instruction. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 1471–1474 (1997)

  4. Neumeyer, L., Franco, H., Weintraub, M., Price, P.: Automatic text-independent pronunciation scoring of foreign language student speech. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, IEEE, pp. 1457–1460 (1996)

  5. Witt, S.M.: Automatic error detection in pronunciation training: Where we are and where we need to go. In: International Symposium on Automatic Detection of Errors in Pronunciation Training, Stockholm, Sweden (2012)

  6. Hafen, R.P., Henry, M.J.: Speech information retrieval: a review. Multimed. Syst. 18(6), 499–518 (2012)

    Article  Google Scholar 

  7. Franco, H., Neumeyer, L., Ramos, M., Bratt, H.: Automatic detection of phone-level mispronunciation for language learning. In: Sixth European Conference on Speech Communication and Technology (1999)

  8. Witt, S.M., Young, S.J.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30(2–3), 95–108 (2000)

    Article  Google Scholar 

  9. Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R.: Automatic mispronunciation detection for Mandarin. In: Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, IEEE, pp. 5077–5080 (2008)

  10. Young S., Kershaw, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.0) (2000)

  11. Ito, A., Lim, Y.-L., Suzuki, M., Makino, S.: Pronunciation error detection method based on error rule clustering using a decision tree. In: Ninth European Conference on Speech Communication and Technology (2005)

  12. Jiang, H.: Confidence measures for speech recognition: A survey. Speech Commun. 45(4), 455–470 (2005)

    Article  Google Scholar 

  13. Rose, R.C., Juang, B.-H., Lee, C.-H.: A training procedure for verifying string hypotheses in continuous speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 281–284 (1995)

  14. Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)

    Article  Google Scholar 

  15. Zhang, R., Rudnicky, A.I.: Word level confidence annotation using combinations of features. In: Seventh European Conference on Speech Communication and Technology (2001)

  16. Liu, Y., Fung, P.: Modeling partial pronunciation variations for spontaneous Mandarin speech recognition. Comput. Speech Lang. 17(4), 357–379 (2003)

    Article  MathSciNet  Google Scholar 

  17. Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C. and Zavaliagkos, G. (1999)Stochastic pronunciation modelling from hand-labelled phonetic corpora. Speech Communication, 29(2-4), pp.209–224

  18. Minhas, R.A., Javed, A., Irtaza, A., Mahmood, M.T., Joo, Y.B.: Shot classification of field sports videos using alexnet convolutional neural network. Appl. Sci. 9(3), 483 (2019)

    Article  Google Scholar 

  19. Wei, S., Hu, G., Hu, Y., Wang, R.-H.: A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Commun. 51(10), 896–905 (2009)

    Article  Google Scholar 

  20. Lu, L., Zhang, H.-J.: Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimed. Syst. 10(4), 332–343 (2005)

    Article  Google Scholar 

  21. Lu, L., Jiang, H., Zhang, H.: A robust audio classification and segmentation method. In: Proceedings of the ninth ACM international conference on Multimedia, pp. 203–211 (2001)

  22. Lu, L., Li, S.Z., Zhang, H.-J.: Content-based audio segmentation using support vector machines. In: IEEE International Conference on Multimedia and Expo, 2001. ICME 2001, IEEE, pp. 749–752 (2001)

  23. Li, D., Sethi, I.K., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recogn. Lett. 22(5), 533–544 (2001)

    Article  MATH  Google Scholar 

  24. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Dig. Signal Process. 10(1–3), 19–41 (2000)

    Article  Google Scholar 

  25. Khan, M.K.S., Al-Khatib, W.G.: Machine-learning based classification of speech and music. Multimed. Syst. 12(1), 55–67 (2006)

    Article  Google Scholar 

  26. Nazir, F., Majeed, M.N., Ghazanfar, M.A., Maqsood, M.: Mispronunciation detection using deep convolutional neural network features and transfer learning-based model for arabic phonemes. IEEE Access 7, 52589–52608 (2019)

    Article  Google Scholar 

  27. Georgoulas, G., Georgopoulos, V.C., Stylios, C.D.: Speech sound classification and detection of articulation disorders with support vector machines and wavelets. In: Engineering in Medicine and Biology Society, 2006. EMBS'06. 28th Annual International Conference of the IEEE, IEEE, pp. 2199–2202 (2006)

  28. Abdou, S.M., Hamid, S.E., Rashwan, M., Samir, A., Abdel-Hamid, O., Shahin, M., Nazih, W.: Computer aided pronunciation learning system using speech recognition techniques. In: Ninth International Conference on Spoken Language Processing (2006)

  29. Li, K., Qian, X., Kang, S., Meng, H.: Lexical stress detection for L2 English speech using deep belief networks. In: Interspeech, pp 1811–1815 (2013)

  30. Al Hindi, A., Alsulaiman, M., Muhammad, G., Al-Kahtani, S.: Automatic pronunciation error detection of nonnative Arabic Speech. In: Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014. IEEE, pp. 190–197 (2014)

  31. Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)

    Article  Google Scholar 

  32. Maqsood, M., Habib, H.A., Nawaz, T.: An efficientmis pronunciation detection system using discriminative acoustic phonetic features for arabic consonants. Int. Arab. J. Inf. Technol. 16(2), 242–250 (2019)

    Google Scholar 

  33. Maqsood, M., Habib, H., Anwar, S., Ghazanfar, M., Nawaz, T.: A comparative study of classifier based mispronunciation detection system for confusing arabic phoneme pairs. Nucleus 54(2), 114–120 (2017)

    Google Scholar 

  34. Maqsood, M., Habib, H.A., Nawaz, T., Haider, K.Z.: A complete mispronunciation detection system for Arabic phonemes using SVM. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 16(3), 30 (2016)

    Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muazzam Maqsood.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Muazzam Maqsood.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nazir, F., Majeed, M.N., Ghazanfar, M.A. et al. A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering. Multimedia Systems 29, 1699–1715 (2023). https://doi.org/10.1007/s00530-021-00822-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00822-5

Keywords

Navigation