Skip to main content
Log in

English speech recognition based on deep learning with multiple features

  • Published:
Computing Aims and scope Submit manuscript

Abstract

English is one of the widely used languages, with the shrinking of the global village, the smart home, the in-vehicle voice system and voice recognition software with English as the recognition language have gradually entered people’s field of vision, and have obtained the majority of users’ love by the practical accuracy. And deep learning technology in many tasks with its hierarchical feature learning ability and data modeling capabilities has achieved more than the performance of shallow learning technology. Therefore, this paper takes English speech as the research object, and proposes a deep learning speech recognition algorithm that combines speech features and speech attributes. Firstly, the deep neural network supervised learning method is used to extract the high-level features of the speech, select the output of the fixed hidden layer as the new speech feature for the newly generated network, and train the GMM–HMM acoustic model with the new speech features; secondly, the speech attribute extractor based on deep neural network is trained for multiple speech attributes, and the extracted speech attributes are classified into phoneme by deep neural network; finally, speech features and speech attribute features are merged into the same CNN framework by the neural network based on the linear feature fusion algorithm. The experimental results show that the proposed English speech recognition algorithm based on deep neural network with multiple features can directly and effectively combine the two methods by combining the speech features and the speech attributes of the speaker in the input layer of the deep neural network, and it can improve the performance of the English speech recognition system significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Nassif AB, Shahin I, Attili I et al (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7(99):19143–19165

    Article  Google Scholar 

  2. Toth L, Hoffmann I, Gosztolya G et al (2018) A speech recognition-based solution for the automatic detection of mild cognitive impairment from spontaneous speech. Curr Alzheimer Res 15(2):130–138

    Article  Google Scholar 

  3. Schillingmann L, Ernst J, Keite V et al (2018) AlignTool: the automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes. Behav Res Methods 50(2):466–489

    Article  Google Scholar 

  4. Coutrot A, Hsiao JH, Chan AB (2018) Scanpath modeling and classification with hidden Markov models. Behav Res Methods 50(1):362–379

    Article  Google Scholar 

  5. Ali Z, Abbas AW, Thasleema TM, Uddin B, Raaz T, Abid SAR (2015) Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN. Int J Speech Technol 18(2):271–275

    Article  Google Scholar 

  6. Satori H, Zealouk O, Satori K et al (2017) Voice comparison between smokers and non-smokers using HMM speech recognition system. Int J Speech Technol 20(12):1–7

    Google Scholar 

  7. Bocchieri E (2017) System and method for speech recognition modeling for mobile voice search. Jersey Citynj Usphiladelphiapa Uschathamnj Us 47(10):4888–4891

    Google Scholar 

  8. Telmem M, Ghanou Y (2018) Estimation of the optimal HMM parameters for amazigh speech recognition system using CMU-Sphinx. Procedia Comput Sci 127:92–101

    Article  Google Scholar 

  9. Siniscalchi SM, Salerno VM (2017) Adaptation to new microphones using artificial neural networks with trainable activation functions. IEEE Trans Neural Netw Learn Syst 28(8):1959–1965

    Article  MathSciNet  Google Scholar 

  10. Enarvi S, Smit P, Virpioja S et al (2017) Automatic speech recognition with very large conversational finnish and estonian vocabularies. IEEE/ACM Trans Audio Speech Lang Process 25(11):2085–2097

    Article  Google Scholar 

  11. Yan Z, Qiang H, Jian X (2013) A scalable approach to using DNN-derived features in GMM–HMM based acoustic modeling for LVCSR. Math Comput 44(170):519–521

    Google Scholar 

  12. Sailor HB, Patil HA, Sailor HB et al (2016) Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(12):2341–2353

    Article  Google Scholar 

  13. Cairong Z, Xinran Z, Cheng Z et al (2016) A novel DBN feature fusion model for cross-corpus speech emotion recognition. J Electr Comput Eng 2016(4):1–11

    Google Scholar 

  14. Affonso ET, Rosa RL, Rodríguez DZ (2017) Speech quality assessment over lossy transmission channels using deep belief networks. IEEE Signal Process Lett 25(1):70–74

    Article  Google Scholar 

  15. Ali H, Tran SN, Benetos E et al (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29(6):13–19

    Article  Google Scholar 

  16. Jian L, Li Z, Yang X et al (2019) Combining unmanned aerial vehicles with artificial-intelligence technology for traffic-congestion recognition: electronic eyes in the skies to spot clogged roads. IEEE Consum Electron Mag 8(3):81–86

    Article  Google Scholar 

  17. Toshitatsu T, Masumura R, Sakauchi S et al (2018) New report preparation system for endoscopic procedures using speech recognition technology. Endosc Int Open 6(6):E676–E687

    Article  Google Scholar 

  18. Ishimitsu S (2018) Speech recognition method and speech recognition apparatus. J Acoust Soc Am 94(109):3538

    Google Scholar 

  19. Abdelaziz AH (2018) Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans Audio Speech Lang Process 26(3):475–484

    Article  Google Scholar 

  20. Fadlullah ZM, Tang F, Mao B et al (2017) State-of-the-art deep learning: evolving machine intelligence toward tomorrow’s intelligent network traffic control systems. IEEE Commun Surv Tutor 19(4):2432–2455

    Article  Google Scholar 

  21. Tang D, Bing Q, Liu T (2015) Deep learning for sentiment analysis: successful approaches and future challenges. Wiley Interdiscip Rev Data Min Knowl Discov 5(6):292–303

    Article  Google Scholar 

  22. Chen Miaochao, Shengqi Lu, Liu Qilin (2018) Global regularity for a 2D model of electro-kinetic fluid in a bounded domain. Acta Math Appl Sin Engl Ser 34(2):398–403

    Article  MathSciNet  Google Scholar 

  23. Tomczak JM, Gonczarek A (2017) Learning invariant features using subspace restricted boltzmann machine. Neural Process Lett 45(1):173–182

    Article  Google Scholar 

  24. Zhang F, Mao Q, Shen X et al (2018) Spatially coherent feature learning for pose-invariant facial expression recognition. ACM Trans Multimed Comput Commun Appl 14(1s):1–19

    Article  Google Scholar 

  25. Yin J (2019) Study on the progress of neural mechanism of positive emotions. Transl Neurosci 10(1):93–98. https://doi.org/10.1515/tnsci-2019-0016

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaojuan Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Z. English speech recognition based on deep learning with multiple features. Computing 102, 663–682 (2020). https://doi.org/10.1007/s00607-019-00753-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-019-00753-0

Keywords

Mathematics Subject Classification

Navigation