Abstract
Automatic speech recognition (ASR) is entitled to automate natural speech perception and the processing mechanism through analysis in the linguistic and acoustic features of the speech signal. ASR for children is highly challenging due to their developing physical aspects and rapidly changing articulation features. Therefore, ASR for children is still at its infant level. In this work, a stacked multilayer auto-encoder (AE) network is designed for ASR of the Malayalam vowel, articulated by children in the age group of five to ten. The proposed network structured with an unsupervised pre-training followed by supervised training. The pre-training coupled with two layers of sparse auto-encoders and scaled conjugate gradient (SCG) algorithm used for back-propagation. The auto-encoders are used to pre-train the network in an unsupervised (self- supervised) manner with 40,500 features that include Mel frequency cepstral coefficients (MFCC) and its derivatives, spectrogram formants and zero crossing rate (ZCR). In the softmax layer, the pre-trained network retrained in a supervised manner with bottleneck features. Fine-tuning has been applied in the trained network to enhance its performance. The unsupervised and supervised layers are stacked together to form a comprehensive network. The designed network has shown an average accuracy of 97% in training and 89.5% accuracy in the test data-set.
Similar content being viewed by others
Abbreviations
- ASR:
-
Automatic speech recognition
- AE:
-
Auto-encoder
- MFCC:
-
Mel frequency cepstral coefficients
- SCG:
-
Scaled conjugate gradient
- ZCR:
-
Zero crossing rate
- HMM:
-
Hidden Markov model
- ANNs:
-
Artificial neural networks
- DBN:
-
Deep belief network
- RBM:
-
Restricted Boltzmann machine
- MOM:
-
Method of moments
References
Ionescu CM (2013) The human respiratory system. The human respiratory system. Springer, London, pp 13–22
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Ranzato MA, Huang FJ, Boureau YL, Le Cun Y (2007) Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: IEEE conference on computer vision and pattern recognition, CVPR’07, 2007. IEEE, pp 1–8
Pillai LG, Sherly E (2017) A deep learning based evaluation of articulation disorder and learning assistive system for autistic children. Int J Nat Language Comput (IJNLC) 6(5)
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8599–8603. IEEE, 2013
Hager WW, Zhang H (2006) A survey of nonlinear conjugate gradient methods. Pac J Optim 2(1):35–58
Møller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525–533
Khadse CB, Chaudhari MA, Borghate VB (2016) Electromagnetic compatibility estimator using scaled conjugate gradient backpropagation based artificial neural network. IEEE Trans Ind Inform 13(3):1036–1045
Russell M, D’Arcy S (2007) Challenges for computer recognition of children’s speech. In: Workshop on speech and language technology in education, 2007
Orozco J, Reyes García CA (2003) Detecting pathologies from infant cry applying scaled conjugate gradient neural networks. In: European symposium on artificial neural networks, Bruges (Belgium), pp 349–354, 2003
Nidhyananthan SS, Shantha Selvakumari R, Shenbagalakshmi V (2014) Contemporary speech/speaker recognition with speech from impaired vocal apparatus. In: 2014 international conference on communication and network technologies (ICCNT), pp 198–202. IEEE, 2014
Sabu K, Rao P (2018) Automatic assessment of children’s oral reading using speech recognition and prosody modeling. CSI Trans ICT 6(2):221–225
Russell M, Brown C, Skilling A, Series R, Wallace J, Bonham B, Barker P (1996) Applications of automatic speech recognition to speech and language development in young children. In: Spoken language, 1996. ICSLP 96. Proceedings, fourth international conference on, vol 1, pp 176–179. IEEE, 1996
Vachhani B, Bhat C, Das B, Kopparapu SK (2017) Deep auto encoder based speech features for improved dysarthric speech recognition. Proc Interspeech 2017:1854–1858
Anand AV, Shobana Devi P, Stephen J, Bhadran VK (2012) Malayalam speech recognition system and its application for visually impaired people. In: India conference (INDICON), 2012 annual IEEE, pp 619–624. IEEE, 2012
Ittichaichareon C, Suksri S, Yingthawornsuk T (2012) Speech recognition using MFCC. In: International conference on computer graphics, simulation and modeling (ICGSM'2012), July, pp 28–29, 2012
Kumar AP, Kumar N, Kumar CS, Yadav AK, Sharma A (2016) Speech recognition using arithmetic coding and MFCC for Telugu language. In: 2016 3rd international conference on computing for sustainable global development (INDIACom), pp 265–268. IEEE, 2016
Lad NR, Nirmal JH, Naikare KD (2019) Total variability factor analysis for dysphonia detection. Int J Inf Technol 11(1):67–74
Kulkarni N (2018) Use of complexity based features in diagnosis of mild Alzheimer disease using EEG signals. Int J Inf Technol 10(1):59–64
Shete DS, Patil SB, Patil S (2014) Zero crossing rate and energy of the speech signal of Devanagari script. IOSR JVSP 4(1):1–5
Panda SP, Nayak AK (2016) Automatic speech segmentation in syllable centric speech recognition system. Int J Speech Technol 19(1):9–18
Bansal S, Agrawal SS, Kumar A (2019) Acoustic analysis and perception of emotions in hindi speech using words and sentences. Int J Inf Technol 11(4):807–812
Huber JE, Stathopoulos ET, Curione GM, Ash TA, Johnson K (1999) Formants of children, women, and men: the effects of vocal intensity variation. J Acoust Soc Am 106(3):1532–1542
Sainath TN, Mohamed A-R, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8614–8618. IEEE, 2013
Ahmad W, Shahnawazuddin S, Kathania HK, Pradhan G, Samaddar AB (2017) Improving children's speech recognition through explicit pitch scaling based on iterative spectrogram inversion. In: INTERSPEECH, pp 2391–2395, 2017
Gehring J, Miao Y, Metze F, Waibel A (2013) Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3377–3381. IEEE, 2013
Hsu W-N, Glass J (2018) Extracting domain invariant features by unsupervised learning for robust automatic speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5614–5618. IEEE, 2018
Dendani B, Bahi H, Sari T (2020) Speech enhancement based on deep auto encoder for remote Arabic speech recognition. In: International conference on image and signal processing, pp 221–229. Springer, Cham, 2020
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pillai, L.G., Mubarak, D.M.N. A stacked auto-encoder with scaled conjugate gradient algorithm for Malayalam ASR. Int. j. inf. tecnol. 13, 1473–1479 (2021). https://doi.org/10.1007/s41870-020-00573-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00573-y