Skip to main content
Log in

Sichuan dialect speech recognition with deep LSTM network

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In speech recognition research, because of the variety of languages, corresponding speech recognition systems need to be constructed for different languages. Especially in a dialect speech recognition system, there are many special words and oral language features. In addition, dialect speech data is very scarce. Therefore, constructing a dialect speech recognition system is difficult. This paper constructs a speech recognition system for Sichuan dialect by combining a hidden Markov model (HMM) and a deep long short-term memory (LSTM) network. Using the HMM-LSTM architecture, we created a Sichuan dialect dataset and implemented a speech recognition system for this dataset. Compared with the deep neural network (DNN), the LSTM network can overcome the problem that the DNN only captures the context of a fixed number of information items. Moreover, to identify polyphone and special pronunciation vocabularies in Sichuan dialect accurately, we collect all the characters in the dataset and their common phoneme sequences to form a lexicon. Finally, this system yields a 11.34% character error rate on the Sichuan dialect evaluation dataset. As far as we know, it is the best performance for this corpus at present.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berndt D J, Clikord J. Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 1994, 359–370

    Google Scholar 

  2. Tyagi V. Maximum accept and reject (mars) training of hmm-gmm speech recognition systems. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association. 2008, 956–959

    Google Scholar 

  3. Woodland P C, Odell J J, Valtchev V, Young S J. Large vocabulary continuous speech recognition using UTK. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 1994, 125–128

    Google Scholar 

  4. Gales M J F. Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech & Language, 1998, 12(2): 75–98

    Article  Google Scholar 

  5. Rabiner L R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, 77(2): 257–286

    Article  Google Scholar 

  6. Gales M, Young S. The application of hidden markov models in speech recognition. Foundations and Trends in Signal Processing, 2008, 1(3): 195–304

    Article  Google Scholar 

  7. Jurafsky D. Speech & Language Processing. India: Pearson Education, 2000

    Google Scholar 

  8. Zhang L, Zhang Y, Amari S. Theoretical study of oscillator neurons in recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5242–5248

    Article  MathSciNet  Google Scholar 

  9. Zhang Y. Foundations of implementing the competitive layer model by lotka-volterra recurrent neural networks. IEEE Transactions on Neural Networks, 2010, 21(3): 494–507

    Article  Google Scholar 

  10. Zhang L, Zhang Y. Dynamical properties of background neural networks with uniform firing rate and background input. Chaos, Solitons & Fractals, 2007, 33(3): 979–985

    Article  MathSciNet  Google Scholar 

  11. Wang J, Zhang L, Guo Q, Zhang Y. Recurrent neural networks with auxiliary memory units. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(5): 1652–1661

    Article  MathSciNet  Google Scholar 

  12. Guo Q, Jia J, Shen G Y, Zhang L, Cai L H, Zhang Y. Learning robust uniform features for cross-media social data by using cross autoen-coders. Knowledge-Based Systems, 2016, 102: 64–75

    Article  Google Scholar 

  13. Wang L, Zhang L, Zhang Y. Trajectory predictor by using recurrent neural networks in visual tracking. IEEE Transactions on Cybernetics, 2017, 47(10): 3172–3183

    Article  Google Scholar 

  14. Hinton G, Deng L, Yu D, Dahl G H, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97

    Article  Google Scholar 

  15. Mohamed S, Dahl G E, Hinton G. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14–22

    Article  Google Scholar 

  16. Huang Y, Slaney M, Seltzer M L, Gong Y F. Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. 2014, 845–849

    Google Scholar 

  17. Manohar V, Povey D, Khudanpur S. Semi-supervised maximum mutual information training of deep neural network acoustic models. In: Proceedings of the 6th Annual Conference of the International Speech Communication Association. 2015, 2630–2634

    Google Scholar 

  18. Prabhavalkar R, Rao K, Sainath T, Li B, Johnson L, Jaitly N. A comparison of sequence-to-sequence models for speech recognition. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. 2017, 939–943

    Google Scholar 

  19. Amodei D, Ananthanarayanan S, Anubhai R, Bai J L, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G L, et al. Deep speech 2: end-to-end speech recognition in english and mandarin. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 173–182

    Google Scholar 

  20. Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A. The microsoft 2017 conversational speech recognition system. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2018, 5934–5938

    Google Scholar 

  21. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780

    Article  Google Scholar 

  22. Bilmes J A. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute, 1998, 4(510): 126–406

    Google Scholar 

  23. Hori T, Nakamura A. Speech recognition algorithms using weighted finite-state transducers. Synthesis Lectures on Speech and Audio Processing, 2013, 9(1): 1–162

    Article  Google Scholar 

  24. Forney G D. The viterbi algorithm. Proceedings of the IEEE, 1973, 61(3): 268–278

    Article  MathSciNet  Google Scholar 

  25. Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012

    Book  Google Scholar 

  26. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 855–868

    Article  Google Scholar 

  27. Brown P F, Desouza P V, Mercer R L, Pietra V J D, Lai J C. Class-based n-gram models of natural language. Computational Linguistics, 1992, 18(4): 467–479

    Google Scholar 

  28. Senior A, Sak H, Shafran I. Context dependent phone models for LSTM RNN acoustic modelling. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2015, 4585–4589

    Google Scholar 

  29. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K. The Kaldi speech recognition toolkit. In: Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. 2011

    Google Scholar 

  30. Logan B. Mel frequency cepstral coeffcients for music modeling. In: Proceedings of International Conference on Music Information Retrieval. 2000, 1–11

    Google Scholar 

  31. Williams R J, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, Architectures, and Applications, 1995, 433–486

    Google Scholar 

  32. Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning. Society for Industrial and Applied Mathematics, 2018, 60(2): 223–311

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (2016YFC0801800); General Program of the National Natural Science Foundation of China (Grant No. 61772353); the Key Program of the National Natural Science Foundation of China (Grant No. 61332002); and Fok Ying Tung Education Foundation (151068).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Zhang.

Additional information

Wangyang Ying received his BS degree in computer science from Sichuan University, China in 2016. Currently, he is working toward a MS degree at the Machine Intelligence Laboratory, College of Computer Science, Sichuan University, China. His current research interests include neural network, deep learning, and speech recognition.

Lei Zhang received the BS and MS degrees in mathematics and the PhD degree in computer science from the University of Electronic Science and Technology of China, China in 2002, 2005, and 2008, respectively. She was a post-doctoral research fellow with the Department of Computer Science and Engineering, Chinese University of Hong Kong, China from 2008 to 2009. She is currently a professor with Sichuan University, China. Her current research interests include theory and applications of neural networks based on neocortex computing and big data analysis methods by infinity deep neural networks.

Hongli Deng received her MS degree in University of Science and Technology of China, China in 2008. Currently, she is working toward a PhD degree at the Machine Intelligence Laboratory, College of Computer Science, Sichuan University, China. Her current research interests include neural network, deep learning, and natural language processing.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ying, W., Zhang, L. & Deng, H. Sichuan dialect speech recognition with deep LSTM network. Front. Comput. Sci. 14, 378–387 (2020). https://doi.org/10.1007/s11704-018-8030-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-018-8030-z

Keywords

Navigation