Abstract
In this paper, a model is proposed to improve monophone-based connected word speech recognition for the Hindi language by utilizing the Hidden Markov Model (HMM). The model consists of hybrid subword units and domain-specific syntactic structures. The hybrid units contain both phoneme- and syllable-based subword units. As the syllable-based subword units cover a larger acoustic span, contextual effects are reduced. The syllable-based acoustic units are applied for modelling only nasal sound in the hybrid model for improving the recognition score of a nasal sound. Further, improvement is proposed using syntactic structures in the grammar definition during the recognition process. Using the domain-specific syntactic structures in the grammar, the search space for the recognizer is reduced; consequently, the performance of the system is improved. For example, two grammar definitions (gram1) with no restriction and grammar(gram2) with domain-specific structures were applied. The speech recognition framework was implemented using the HMM-based toolkit HTK with five-state HMMs. The self-created connected word speech dataset is used with a vocabulary of 240 Hindi words. The Mel frequency cepstral coefficients (MFCCs), MFCCs with energy (MFCC_E), and perceptual linear prediction coefficients with energy (PLP_E) are utilized for feature extraction. Further, monophones were trained with and without using silence fixing to check the impact of short pauses on the recognizer’s performance. The system was tested for both speaker-dependent and speaker-independent modes. It was found that using a hybrid model and grammar(gram2) with silence fixing provided the best results. The system obtained an overall word accuracy of 80.28%, word correct of 80.28%, and a word error rate of 19.72% using MFCCs, gram2, phoneme-based modelling, and silence fixing. For the PLP_E coefficients, hybrid model, silence fixing, and gram2, the system obtained an overall word accuracy of 88.54%, word correct of 88.54%, and the word error rate of 11.46%.
Similar content being viewed by others
References
Bansal P, Dev A and Jain S B 2008 Optimum HMM combined with vector quantization for Hindi speech recognition. IETE Journal of Research 54: 239–243
Li Qin, Yuze Yang, Tianxiang Lan, Huifeng Zhu, Qi Wei, Fei Qiao, Xinjun Liu and Huazhong Yang 2020 MSP-MFCC: energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications. IEEE Access 8: 48720–48730
Rabiner L R 1997 Applications of speech recognition in the area of telecommunications. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 501–510
Saon G and Chien J T 2012 Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Signal Processing Magazine 29: 18–33
Patil A, More P and Sasikumar M 2019 Incorporating finer acoustic-phonetic features in the lexicon for Hindi language speech recognition. Journal of Information and Optimization Sciences 40(8): 1731–1739
Ying W, Zhang L and Deng H 2020 Sichuan dialect speech recognition with deep LSTM network. Frontiers of Computer Science 14: 378–387
Cutajar M, Gatt E, Grech I, Casha O and Micallef J 2013 Comparative study of automatic speech recognition techniques. IET Signal Processing 7(1): 25–46
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D and Valtchev V 2002 The HTK book. Cambridge University Engineering Department, vol. 3(175), p. 12
Rabiner L R and Juang B H 1993 Fundamentals of speech recognition. Prentice-Hall International
Dev A, Agrawal S S and Choudhury D R 2003 Categorization of Hindi phonemes by neural networks. AI and Society 17: 375–382
Alsharhan E and Ramsay A 2019 Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Information Processing and Management 56: 343–353
Passricha V and Aggarwal R K 2020 A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. Journal of Ambient Intelligence and Humanized Computing 11: 675–691
Rapholo M, Manamela M J D and Gasela N Improving the performance of an automatic speech recognizer with domain-specific syntax structures. http://satnac.org.za/proceedings/2011/papers/Network_Services/136.pdf (accessed 19 Jan 2017)
Dannenberg A, Werner S and Vainio M 2016 Prosodic and syntactic structures in spontaneous english speech. In: Proceedings of the International Conference on Speech Prosody, pp. 59–63
Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, Huang H, Tjandra A, Zhang X, Zhang F and Fuegen C 2020 Transformer-based acoustic modeling for hybrid speech recognition. In: Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6874–6878
Graves A, Jaitly N and Mohamed A R 2013 Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278
Sinha S, Agrawal S S and Jain A 2013 Continuous density hidden Markov model for context dependent Hindi speech recognition. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1953–1958
Bhatt S, Dev A and Jain A 2018 Hindi speech vowel recognition using hidden Markov model. In: Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 196–199
Dev A 2009 Effect of retroflex sounds on the recognition of Hindi voiced and unvoiced stops. AI and Society 23: 603–612
Samudravijaya K 2003 Durational characteristics of Hindi stop consonants. In: Proceedings of EUROSPEECH 2003 – 8th European Conference on Speech and Communication Technology, pp. 81–84
Bansal S and Dev A 2015 Emotional Hindi speech: feature extraction and classification. In: Proceedings of the 2nd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE, pp. 1865–1868
Anusuya M A and Katti S K 2010 Speech recognition by machine a review. arXiv preprint arXiv:1001.2267
Kaur A and Singh A 2016 Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016. Institute of Electrical and Electronics Engineers Inc, pp. 2104–2108
Myers C and Levinson S 1982 Speaker independent connected word recognition using a syntax-directed dynamic programming procedure. IEEE Transactions on Acoustics, Speech, and Signal Processing 30(4): 561–565
Patil P P and Pardeshi S A 2014 Marathi connected word speech recognition system. In: Proceedings of the First International Conference on Networks & Soft Computing, pp. 314–318
Haeb-Umbach R, Geller D and Ney H 1993 Improvements in connected digit recognition using linear discriminant analysis and mixture densities. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 239–242
Kumar K, Aggarwal R K and Jain A 2012 A Hindi speech recognition system for connected words using HTK. International Journal of Computational Systems Engineering 1(1): 25–32
Singhal S and Dubey R K 2015 Automatic speech recognition for connected words using DTW/HMM for English/Hindi languages. In: Proceedings of Communication, Control and Intelligent Systems, pp. 199–203
Chaudhary A, Chauhan M R and Gupta M G 2013 Automatic speech recognition system for isolated and connected words of Hindi language by using hidden Markov model toolkit (HTK). In: Proceedings of the International Conference on Emerging Trends in Engineering and Technology, Association of Computer Electronics and Electrical Engineers, pp. 847–853
Dağitan U and Yalabik N 1990 Connected word recognition using neural networks. In: Neurocomputing. Berlin–Heidelberg: Springer, pp. 297–300
Reddy D R 1967 Computer recognition of connected speech. Journal of the Acoustic Society of America 42: 329–347
Makhoul J and Schwartz R 1995 State of the art in continuous speech recognition. Proceedings of the National Academy of Sciences 92(22): 9956–9963
Madan A and Gupta D 2014 Speech feature extraction and classification: a comparative review. International Journal of Computer Applications 90(9): 20–25
Jurafsky D and Martin J H 2007 Speech recognition: advanced topics. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, pp. 1–34
Anusuya M A and Katti S K 2011 Front end analysis of speech recognition: a review. International Journal of Speech Technology 14(2): 99–145
Fook C Y, Muthusamy H, Chee L S, Yaacob S B and Adom A H B 2013 Comparison of speech parameterization techniques for the classification of speech disfluencies. Turkish Journal of Electrical Engineering & Computer Sciences 21(1): 1983–1994
Krishnan Murali, Neophytou C P and Glenn Prescott 1994 Wavelet transform speech recognition using vector quantization, dynamic time warping and artificial neural networks. Center for Excellence in Computer Aided Systems Engineering and Telecommunications & Information Science Laboratory
Burget L 2004 Combination of speech features using smoothed heteroscedastic linear discriminant analysis. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2549–2552
Botros N 1991 Neural nets for speech recognition advantages and limitations. In: Proceedings of Electro International, pp. 476–481
Hermansky H 1990 Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America 87: 1738–1752
University of Cambridge 1989 HTK Speech Recognition Toolkit. http://htk.eng.cam.ac.uk/ (accessed 26 Jan 2016)
Sadhukhan T, Bansal S and Kumar A 2017 Automatic identification of spoken language. IOSR Journal of Computer Engineering 19(2): 84–89
Malviya S, Mishra R and Tiwary U S 2017 Structural analysis of Hindi phonetics and a method for extraction of phonetically rich sentences from a very large Hindi text corpus. In: Proceedings of the 2010 Conference of the Oriental Chapter of Int ernational Committee for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA), pp. 188–193
Bhuvanagirir K and Kopparapu S K 2012 Mixed language speech recognition without explicit identification of language. American Journal of Signal Processing 2(5): 92–97
Kuamr A, Dua M and Choudhary T 2014 Continuous Hindi speech recognition using Gaussian mixture HMM. In: Proceedings of the IEEE Students’ Conference on Electrical, Electronics and Computer Science, pp. 1–5
Kiran N and Ward N G 2008 Testing the value of a time-based language model for speech recognition. Tech. Rep. UTEP-CS-08-29, Department of Computer Science, University of Texas at El Paso,
Tutorial: Create acoustic model manually. http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial (accessed 20 Jan 2017)
Paul B and Praat D W 2017 Doing phonetics by computer. http://www.fon.hum.uva.nl/praat/ (accessed 20 Jan 2017)
Seng S, Sam S, Le V B, Bigi B and Besacier L 2008 Which units for acoustic and language modeling for Khmer automatic speech recognition. In: Proceedings of Spoken Languages Technologies for Under-Resourced Languages, pp. 33–38
Lee C H, Juang B H, Soong F K and Rabiner L R 1989 Word recognition using whole word and subword models. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 683–686
Acknowledgement
The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Government of India, for providing financial assistance for this research work through “Visvesvaraya PhD Scheme for Electronics and IT”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
BHATT, S., JAIN, A. & DEV, A. Monophone-based connected word Hindi speech recognition improvement. Sādhanā 46, 99 (2021). https://doi.org/10.1007/s12046-021-01614-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-021-01614-3