A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering

Nazir, Faria; Majeed, Muhammad Nadeem; Ghazanfar, Mustansar Ali; Maqsood, Muazzam

doi:10.1007/s00530-021-00822-5

A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering

Special Issue Paper
Published: 19 July 2021

Volume 29, pages 1699–1715, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Faria Nazir¹,
Muhammad Nadeem Majeed²,
Mustansar Ali Ghazanfar³ &
…
Muazzam Maqsood ORCID: orcid.org/0000-0002-2709-0849⁴

302 Accesses
3 Citations
Explore all metrics

Abstract

Nowadays, the demand for language learning is increasing because people need to communicate with other people belonging to different regions for their business deals, study, etc. During language learning, a lot of pronunciation mistakes occur due to unfamiliarity with a new language and differences in accent. In this paper, we perform speech mistakes analysis using deep feature-based clustering. We proposed two novel methods for speech analysis, one to deal with phonemic errors (confusing phonemes) and the other to deal with the prosodic errors (partially changed pronunciation variation of phones). For accurate and efficient language learning, it is important to learn both phonemic as well as prosodic error corrections. In our first method, we perform speech analysis by combining deep CNN features and clustering algorithm to detect the phonemic errors. We classify the phonemes using K-nearest neighbor, Naïve Bayes, and support vector machine (SVM). We perform experiments on the six most frequently mispronounced confusing pairs of Arabic to handle phonemic errors and achieve an accuracy of 94%. In our second method, we proposed the unsupervised phone variation model (PVM) to detect prosodic errors. In PVM, each phone is extended to represent the different types of pronunciation variation of that phone with different proficiency levels. We use an Arabic dataset of 28 individual phones for speech analysis and provide feedback based on the variation of each phone and achieves an accuracy of 97%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Early dementia detection with speech analysis and machine learning techniques

Article Open access 11 April 2024

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Availability of data and material

The data can be provided on request.

Code availability

Not applicable.

References

Precoda, K., Halverson, C.A., Franco, H.: Effects of speech recognition-based pronunciation feedback on second-language pronunciation ability. Proc. InSTILL 2000, 102–105 (2000)
Google Scholar
Panda, S.P., Nayak, A.K.: An efficient model for text-to-speech synthesis in Indian languages. Int. J. Speech Technol. 18(3), 305–315 (2015)
Article Google Scholar
Franco, H., Neumeyer, L., Kim, Y., Ronen, O.: Automatic pronunciation scoring for language instruction. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 1471–1474 (1997)
Neumeyer, L., Franco, H., Weintraub, M., Price, P.: Automatic text-independent pronunciation scoring of foreign language student speech. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, IEEE, pp. 1457–1460 (1996)
Witt, S.M.: Automatic error detection in pronunciation training: Where we are and where we need to go. In: International Symposium on Automatic Detection of Errors in Pronunciation Training, Stockholm, Sweden (2012)
Hafen, R.P., Henry, M.J.: Speech information retrieval: a review. Multimed. Syst. 18(6), 499–518 (2012)
Article Google Scholar
Franco, H., Neumeyer, L., Ramos, M., Bratt, H.: Automatic detection of phone-level mispronunciation for language learning. In: Sixth European Conference on Speech Communication and Technology (1999)
Witt, S.M., Young, S.J.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30(2–3), 95–108 (2000)
Article Google Scholar
Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R.: Automatic mispronunciation detection for Mandarin. In: Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, IEEE, pp. 5077–5080 (2008)
Young S., Kershaw, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.0) (2000)
Ito, A., Lim, Y.-L., Suzuki, M., Makino, S.: Pronunciation error detection method based on error rule clustering using a decision tree. In: Ninth European Conference on Speech Communication and Technology (2005)
Jiang, H.: Confidence measures for speech recognition: A survey. Speech Commun. 45(4), 455–470 (2005)
Article Google Scholar
Rose, R.C., Juang, B.-H., Lee, C.-H.: A training procedure for verifying string hypotheses in continuous speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 281–284 (1995)
Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)
Article Google Scholar
Zhang, R., Rudnicky, A.I.: Word level confidence annotation using combinations of features. In: Seventh European Conference on Speech Communication and Technology (2001)
Liu, Y., Fung, P.: Modeling partial pronunciation variations for spontaneous Mandarin speech recognition. Comput. Speech Lang. 17(4), 357–379 (2003)
Article MathSciNet Google Scholar
Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C. and Zavaliagkos, G. (1999)Stochastic pronunciation modelling from hand-labelled phonetic corpora. Speech Communication, 29(2-4), pp.209–224
Minhas, R.A., Javed, A., Irtaza, A., Mahmood, M.T., Joo, Y.B.: Shot classification of field sports videos using alexnet convolutional neural network. Appl. Sci. 9(3), 483 (2019)
Article Google Scholar
Wei, S., Hu, G., Hu, Y., Wang, R.-H.: A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Commun. 51(10), 896–905 (2009)
Article Google Scholar
Lu, L., Zhang, H.-J.: Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimed. Syst. 10(4), 332–343 (2005)
Article Google Scholar
Lu, L., Jiang, H., Zhang, H.: A robust audio classification and segmentation method. In: Proceedings of the ninth ACM international conference on Multimedia, pp. 203–211 (2001)
Lu, L., Li, S.Z., Zhang, H.-J.: Content-based audio segmentation using support vector machines. In: IEEE International Conference on Multimedia and Expo, 2001. ICME 2001, IEEE, pp. 749–752 (2001)
Li, D., Sethi, I.K., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recogn. Lett. 22(5), 533–544 (2001)
Article MATH Google Scholar
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Dig. Signal Process. 10(1–3), 19–41 (2000)
Article Google Scholar
Khan, M.K.S., Al-Khatib, W.G.: Machine-learning based classification of speech and music. Multimed. Syst. 12(1), 55–67 (2006)
Article Google Scholar
Nazir, F., Majeed, M.N., Ghazanfar, M.A., Maqsood, M.: Mispronunciation detection using deep convolutional neural network features and transfer learning-based model for arabic phonemes. IEEE Access 7, 52589–52608 (2019)
Article Google Scholar
Georgoulas, G., Georgopoulos, V.C., Stylios, C.D.: Speech sound classification and detection of articulation disorders with support vector machines and wavelets. In: Engineering in Medicine and Biology Society, 2006. EMBS'06. 28th Annual International Conference of the IEEE, IEEE, pp. 2199–2202 (2006)
Abdou, S.M., Hamid, S.E., Rashwan, M., Samir, A., Abdel-Hamid, O., Shahin, M., Nazih, W.: Computer aided pronunciation learning system using speech recognition techniques. In: Ninth International Conference on Spoken Language Processing (2006)
Li, K., Qian, X., Kang, S., Meng, H.: Lexical stress detection for L2 English speech using deep belief networks. In: Interspeech, pp 1811–1815 (2013)
Al Hindi, A., Alsulaiman, M., Muhammad, G., Al-Kahtani, S.: Automatic pronunciation error detection of nonnative Arabic Speech. In: Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014. IEEE, pp. 190–197 (2014)
Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)
Article Google Scholar
Maqsood, M., Habib, H.A., Nawaz, T.: An efficientmis pronunciation detection system using discriminative acoustic phonetic features for arabic consonants. Int. Arab. J. Inf. Technol. 16(2), 242–250 (2019)
Google Scholar
Maqsood, M., Habib, H., Anwar, S., Ghazanfar, M., Nawaz, T.: A comparative study of classifier based mispronunciation detection system for confusing arabic phoneme pairs. Nucleus 54(2), 114–120 (2017)
Google Scholar
Maqsood, M., Habib, H.A., Nawaz, T., Haider, K.Z.: A complete mispronunciation detection system for Arabic phonemes using SVM. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 16(3), 30 (2016)
Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Software Engineering, University of Engineering and Technology Taxila, Taxila, Pakistan
Faria Nazir
Department of Data Science, University of the Punjab, Lahore, Pakistan
Muhammad Nadeem Majeed
School of Architecture, Computing and Engineering, University of East London, London, UK
Mustansar Ali Ghazanfar
Department of Computer Science, COMSATS University Islamabad, Attock Campus, Pakistan
Muazzam Maqsood

Authors

Faria Nazir
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Nadeem Majeed
View author publications
You can also search for this author in PubMed Google Scholar
Mustansar Ali Ghazanfar
View author publications
You can also search for this author in PubMed Google Scholar
Muazzam Maqsood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muazzam Maqsood.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Muazzam Maqsood.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nazir, F., Majeed, M.N., Ghazanfar, M.A. et al. A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering. Multimedia Systems 29, 1699–1715 (2023). https://doi.org/10.1007/s00530-021-00822-5

Download citation

Received: 22 June 2020
Accepted: 09 June 2021
Published: 19 July 2021
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00530-021-00822-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Early dementia detection with speech analysis and machine learning techniques

A comprehensive survey on automatic speech recognition using neural networks

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Early dementia detection with speech analysis and machine learning techniques

A comprehensive survey on automatic speech recognition using neural networks

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation