SETransformer: Speech Enhancement Transformer

Yu, Weiwei; Zhou, Jian; Wang, HuaBin; Tao, Liang

doi:10.1007/s12559-020-09817-2

SETransformer: Speech Enhancement Transformer

Published: 03 February 2021

Volume 14, pages 1152–1158, (2022)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Weiwei Yu¹,
Jian Zhou¹,
HuaBin Wang¹ &
…
Liang Tao¹

1838 Accesses
20 Citations
Explore all metrics

Abstract

Speech enhancement is a fundamental way to improve speech perception quality in adverse environment where the received speech is seriously corrupted by noise. In this paper, we propose a cognitive computing based speech enhancement model termed SETransformer which can improve the speech quality in unkown noisy environments. The proposed SETransformer takes advantages of LSTM and multi-head attention mechanism, both of which are inspired by the auditory perception principle of human beings. Specifically, the SETransformer pocesses the ability of characterizing the local structure implicated in the speech spectrum and has more lower computation complexity due to its distinctive parallelization perfermance. Experimental results show that, compared with the standard Transformer and the LSTM model, the proposed SETransformer model can consistently achieve better denoising performance in terms of speech quality (PESQ) and speech intelligibility (STOI) under unseen noise conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

References

Wang W, Xing C, Wang D, et al. A Robust Audio-Visual Speech Enhancement Model, in ICASSP 2020 – 45^th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4–8. Virtual Barcelona. 2020.:p. 7529–33.
Li L, Wang D, Zheng F. Neural Discriminant Analysis for Deep Speaker Embedding. in arXiv preprint arXiv:2005.11905, 2020.
Gogate M, Dashtipour K, Bell P, et al. Deep Neural Network Driven Binaural Audio Visual Speech Separation, in 2020 International Joint Conference on Neural Networks (IJCNN), 2020. p. 1–7.
Gogate M, Adeel A, Dashtipour K, et al. AV Speech Enhancement Challenge using a Real Noisy Corpus. in arXiv preprint 2019. arXiv:1910.00424
Gogate M, Dashtipour K, Adeel A, et al. Cochleanet: A robust language-independent audio-visual model for speech enhancement, in arXiv preprint 2019. arXiv:1909.10407
Xu Y, Du J, Dai L, Lee CH. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters. 2014;21(1):65–8.
Article Google Scholar
Narayanan A, Wang D. Ideal ratio mask estimation using deep neural networks for robust speech recognition, in ICASSP 2013 – 38^th IEEE International Conference on Acoustics, Speech and Signal Processing, May 26–31, Vancouver, BC, Canada; 2013. p. 7092–6.
Xu Y, Du J, Dai L, Lee CH. Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement, in Interspeech 2015 – 15^th Annual Conference of the International Speech Communication Association, September 6–10, Gremany, Austria; 2015. p. 1058–512.
Xu Y, Du J, Dai L, Lee CH. A Regression Approach to Speech Enhancement Based on Deep Neural Networks, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(1):7–19.
Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network, in ECMSM 2017 – 15^th IEEE International Workshop of Electronics, May 24–26, Donostia-San Sebastian, Spain; 2017. p. 1–5.
Park SR, Lee J. A fully convolutional neural network for speech enhancement, in Interspeech 2017 – 17^th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden; 2017. p. 1993–7.
Fu S, Tsao Y, Lu X. Raw Waveform-based Speech Enhancement by Fully Con- volutional Networks, in APSIPA ASC 2017 – 9^th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, December 12–15, Kuala Lumpur, Malaysia; 2017. p. 6–12.
Grais EM, Plumbley MD, Single Channel Audio Source Separation using Con- volutional Denoising Autoencoders, in GlobalSIP 2017 – 5^th IEEE Global Conference on Signal and Information Processing, November 14–16, Montreal, QC, Canada; 2017, p. 1265–9.
Huang P, Kim M, Hasegawa-Johnson M, Smaragdis P. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(12):2136–47.
Sun L, Du J, Dai L, Lee C. Multiple-target deep learning for LSTM-RNN based speech enhancement, in HSCMA 2017 – 15^th Hands-free Speech Communications and Microphone Arrays, March 1–3, San Francisco, California; 2017. p. 136–40.
Sainath TN, Vinyals O, Senior A, Sak H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, in ICASSP 2015 – 40^th IEEE International Conference on Acoustics, Speech and Signal Processing, April 19–24, Brisbane, QLD, Australia; 2015, p. 4580–4.
Mimilakis SI, Drossos K, Santos JF, Schuller G, Virtanen T, Bengio Y. Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask, in ICASSP 2018 – 43^th IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, Calgary, AB, Canada; 2018. p. 721–5.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017, p. 5998–6008.
Wang Z, Ma Y, Liu Z, Tang J. R-Transformer: Recurrent Neural Network Enhanced Transformer, in arXiv preprint, 2019. arXiv: 1907.05572
Wisdom S, Powers T, Hershey J, Roux JL, Atlas L. Full-capacity unitary recurrent neural networks. Adv Neural Inf Process Syst. 2016;4880–8.
Ba, JL Kiros JR, Hinton GE. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, in arXiv preprint, 2016. arXiv:1607.06450
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning, in Proceedings of the 34th International Conference on Machine Learning. 2017;70:1243–52.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. ARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1, in NASA STI/Recon technical report n. 1993;93.
Snyder D, Chen G, Povey D. Musan: A music, speech, and noise corpus, in arXiv preprint, 2015. arXiv:1510.08484
Kingma D, Ba J. Musan: A music, speech, and noise corpus, in arXiv preprint, 2014. arXiv:1412.6980
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, Dropout: a simple way to prevent neural networks from overfitting, J Mach Lear Res. 2014;1929–58.
Valentini C, Wang X, Takaki S, Yamagishi J. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech, 9th ISCA Speech Synthesis Workshop. 2016; p. 146–52.
Xue Y, Xu T, Zhang H, Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics. 2018;16(3–4):383–39.
Article Google Scholar
Shah N, Patil A, Soni H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2018; p. 1246–51.
Jansson A, Humphrey E, Montecchio N, et al. Singing voice separation with deep u-net convolutional networks, n Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2017; p. 323–32.
Luo Y, Mesgarani N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation, ICASSP 2018 – 43^th IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 22–27, Seoul, South Korea. 2018; p. 696–700.

Download references

Funding

This study was funded in part by Natural Science Fund Project of China under No.61301295, the Anhui Natural Science Fund Project (No.1708085MF151), and Anhui University Natural Science Research Project (KJ2018A0018).

Author information

Authors and Affiliations

School of Computer Science and Technology, Anhui University, Hefei, China
Weiwei Yu, Jian Zhou, HuaBin Wang & Liang Tao

Authors

Weiwei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
HuaBin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, W., Zhou, J., Wang, H. et al. SETransformer: Speech Enhancement Transformer. Cogn Comput 14, 1152–1158 (2022). https://doi.org/10.1007/s12559-020-09817-2

Download citation

Received: 29 September 2020
Accepted: 29 December 2020
Published: 03 February 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s12559-020-09817-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SETransformer: Speech Enhancement Transformer

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Ethical Approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SETransformer: Speech Enhancement Transformer

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Ethical Approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation