Skip to main content
Log in

SETransformer: Speech Enhancement Transformer

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Speech enhancement is a fundamental way to improve speech perception quality in adverse environment where the received speech is seriously corrupted by noise. In this paper, we propose a cognitive computing based speech enhancement model termed SETransformer which can improve the speech quality in unkown noisy environments. The proposed SETransformer takes advantages of LSTM and multi-head attention mechanism, both of which are inspired by the auditory perception principle of human beings. Specifically, the SETransformer pocesses the ability of characterizing the local structure implicated in the speech spectrum and has more lower computation complexity due to its distinctive parallelization perfermance. Experimental results show that, compared with the standard Transformer and the LSTM model, the proposed SETransformer model can consistently achieve better denoising performance in terms of speech quality (PESQ) and speech intelligibility (STOI) under unseen noise conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Wang W, Xing C, Wang D, et al. A Robust Audio-Visual Speech Enhancement Model, in ICASSP 2020 – 45th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4–8. Virtual Barcelona. 2020.:p. 7529–33.

  2. Li L, Wang D, Zheng F. Neural Discriminant Analysis for Deep Speaker Embedding. in arXiv preprint arXiv:2005.11905, 2020.

  3. Gogate M, Dashtipour K, Bell P, et al. Deep Neural Network Driven Binaural Audio Visual Speech Separation, in 2020 International Joint Conference on Neural Networks (IJCNN), 2020.  p. 1–7.

  4. Gogate M, Adeel A, Dashtipour K, et al. AV Speech Enhancement Challenge using a Real Noisy Corpus. in arXiv preprint 2019. arXiv:1910.00424

  5. Gogate M, Dashtipour K, Adeel A, et al. Cochleanet: A robust language-independent audio-visual model for speech enhancement, in arXiv preprint 2019. arXiv:1909.10407

  6. Xu Y, Du J, Dai L, Lee CH. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters. 2014;21(1):65–8.

    Article  Google Scholar 

  7. Narayanan A, Wang D. Ideal ratio mask estimation using deep neural networks for robust speech recognition, in ICASSP 2013 – 38th IEEE International Conference on Acoustics, Speech and Signal Processing, May 26–31, Vancouver, BC, Canada; 2013. p. 7092–6.

  8. Xu Y, Du J, Dai L, Lee CH. Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement, in Interspeech 2015 – 15th Annual Conference of the International Speech Communication Association, September 6–10, Gremany, Austria; 2015. p. 1058–512.

  9. Xu Y, Du J, Dai L, Lee CH. A Regression Approach to Speech Enhancement Based on Deep Neural Networks, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(1):7–19.

  10. Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network, in ECMSM 2017 – 15th IEEE International Workshop of Electronics, May 24–26, Donostia-San Sebastian, Spain; 2017. p. 1–5.

  11. Park SR, Lee J. A fully convolutional neural network for speech enhancement, in Interspeech 2017 – 17th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden; 2017. p. 1993–7.

  12. Fu S, Tsao Y, Lu X. Raw Waveform-based Speech Enhancement by Fully Con- volutional Networks, in APSIPA ASC 2017 – 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, December 12–15, Kuala Lumpur, Malaysia; 2017. p. 6–12.

  13. Grais EM, Plumbley MD, Single Channel Audio Source Separation using Con- volutional Denoising Autoencoders, in GlobalSIP 2017 – 5th IEEE Global Conference on Signal and Information Processing, November 14–16, Montreal, QC, Canada; 2017, p. 1265–9.

  14. Huang P, Kim M, Hasegawa-Johnson M, Smaragdis P. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(12):2136–47.

  15. Sun L, Du J, Dai L, Lee C. Multiple-target deep learning for LSTM-RNN based speech enhancement, in HSCMA 2017 – 15th Hands-free Speech Communications and Microphone Arrays, March 1–3, San Francisco, California; 2017. p. 136–40.

  16. Sainath TN, Vinyals O, Senior A, Sak H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, in ICASSP 2015 – 40th IEEE International Conference on Acoustics, Speech and Signal Processing, April 19–24, Brisbane, QLD, Australia; 2015, p. 4580–4.

  17. Mimilakis SI, Drossos K, Santos JF, Schuller G, Virtanen T, Bengio Y. Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask, in ICASSP 2018 – 43th IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, Calgary, AB, Canada; 2018. p. 721–5.

  18. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017, p. 5998–6008.

  19. Wang Z, Ma Y, Liu Z, Tang J. R-Transformer: Recurrent Neural Network Enhanced Transformer, in arXiv preprint, 2019. arXiv: 1907.05572

  20. Wisdom S, Powers T, Hershey J, Roux JL, Atlas L. Full-capacity unitary recurrent neural networks. Adv Neural Inf Process Syst. 2016;4880–8.

  21. Ba, JL Kiros JR, Hinton GE. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, in arXiv preprint, 2016. arXiv:1607.06450

  22. Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning, in Proceedings of the 34th International Conference on Machine Learning. 2017;70:1243–52.

  23. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. ARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1, in NASA STI/Recon technical report n. 1993;93.

  24. Snyder D, Chen G, Povey D. Musan: A music, speech, and noise corpus, in arXiv preprint, 2015. arXiv:1510.08484

  25. Kingma D, Ba J. Musan: A music, speech, and noise corpus, in arXiv preprint, 2014. arXiv:1412.6980

  26. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, Dropout: a simple way to prevent neural networks from overfitting, J Mach Lear Res. 2014;1929–58.

  27. Valentini C, Wang X, Takaki S, Yamagishi J. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech, 9th ISCA Speech Synthesis Workshop. 2016; p. 146–52.

  28. Xue Y, Xu T, Zhang H, Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics. 2018;16(3–4):383–39.

    Article  Google Scholar 

  29. Shah N, Patil A, Soni H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2018; p. 1246–51.

  30. Jansson A, Humphrey E, Montecchio N, et al. Singing voice separation with deep u-net convolutional networks, n Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2017; p. 323–32.

  31. Luo Y, Mesgarani N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation, ICASSP 2018 – 43th IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 22–27, Seoul, South Korea. 2018; p. 696–700.

Download references

Funding

This study was funded in part by Natural Science Fund Project of China under No.61301295, the Anhui Natural Science Fund Project (No.1708085MF151), and Anhui University Natural Science Research Project (KJ2018A0018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, W., Zhou, J., Wang, H. et al. SETransformer: Speech Enhancement Transformer. Cogn Comput 14, 1152–1158 (2022). https://doi.org/10.1007/s12559-020-09817-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-020-09817-2

Keywords

Navigation