Abstract
Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest. However, these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech. In this study, we propose a novel weighted-factor autoencoder (WFAE) model for MSS, which introduces a regularization loss in the objective function to isolate one source without containing other sources. By incorporating a latent attention mechanism and a supervised source constructor in the separation layer, WFAE can learn source-specific generative factors and a set of discriminative features for each source, leading to MSS performance improvement. Experiments on benchmark datasets show that our approach outperforms the existing methods. In terms of three important metrics, WFAE has great success on a relatively challenging MSS case, i.e., speaker-independent MSS.
Similar content being viewed by others
References
Araki S, Sawada H, Mukai R, et al., 2007. Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process, 87(8):1833–1847. https://doi.org/10.1016/j.sigpro.2007.02.003
Benesty J, Chen JD, Huang YT, 2008. Microphone Array Signal Processing. Springer, Berlin, Germany.
Bregman AS, 1990. Auditory Scene Analysis: the Perceptual Organization of Sound. The MIT Press, Cambridge.
Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297–336. https://doi.org/10.1006/csla.1994.1016
Chen Z, Luo Y, Mesgarani N, 2017. Deep attractor network for single-microphone speaker separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.246–250. https://doi.org/10.1109/ICASSP.2017.7952155
Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phasesensitive and recognition-boosted speech separation using deep recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.708–712. https://doi.org/10.1109/ICASSP.2015.7178061
Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report, NASA, USA.
Ghahramani Z, Jordan MI, 1997. Factorial hidden Markov models. Mach Learn, 29(2–3):245–273. https://doi.org/10.1023/A:1007425814087
Gou JP, Yi Z, Zhang D, et al., 2018. Sparsity and geometry preserving graph embedding for dimensionality reduction. IEEE Access, 6:75748–75766. https://doi.org/10.1109/ACCESS.2018.2884027
Grais EM, Plumbley MD, 2017. Single channel audio source separation using convolutional denoising autoencoders. Proc IEEE Global Conf on Signal and Information Processing, p.1265–1269. https://doi.org/10.1109/GlobalSIP.2017.8309164
Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.31–35. https://doi.org/10.1109/ICASSP.2016.7471631
Hsu WN, Zhang Y, Glass J, 2017. Learning latent representations for speech generation and transformation. 18th Annual Conf of the Int Speech Communication Association, p.1273–1277.
Hu K, Wang DL, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122–131. https://doi.org/10.1109/TASL.2012.2215591
Huang PS, Kim M, Hasegawa-Johnson M, et al., 2014. Deep learning for monaural speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1562–1566. https://doi.org/10.1109/ICASSP.2014.6853860
Hyvärinen A, Oja E, 2000. Independent component analysis: algorithms and applications. Neur Netw, 13(4–5):411–430. https://doi.org/10.1016/S0893-6080(00)00026-5
Karamatli E, Cemgil AT, Kirbiz S, 2019. Weak label supervision for monaural source separation using non-negative denoising variational autoencoders. Proc 27th Signal Processing and Communications Applications Conf, p.1–4. https://doi.org/10.1109/SIU.2019.8806536
Kolbæk M, Yu D, Tan ZH, et al., 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process, 25(10):1901–1913. https://doi.org/10.1109/TASLP.2017.2726762
Luo Y, Mesgarani N, 2019. Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process, 27(8):1256–1266. https://doi.org/10.1109/TASLP.2019.2915167
Luo Y, Chen Z, Yoshioka T, 2019. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. https://arxiv.org/abs/1910.06379
Nadas A, Nahamoo D, Picheny MA, 1989. Speech recognition using noise-adaptive prototypes. IEEE Trans Acoust Speech Signal Process, 37(10):1495–1503. https://doi.org/10.1109/29.35387
Osako K, Mitsufuji Y, Singh R, et al., 2017. Supervised monaural source separation based on autoencoders. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.11–15. https://doi.org/10.1109/ICASSP.2017.7951788
Panayotov V, Chen GG, Povey D, et al., 2015. LibriSpeech: an ASR corpus based on public domain audio books. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
Pandey L, Kumar A, Namboodiri V, 2018. Monaural audio source separation using variational autoencoders. Proc Interspeech, p.3489–3493. https://doi.org/10.21437/Interspeech.2018-1140
Qian YM, Weng C, Chang XK, et al., 2018. Past review, current progress, and challenges ahead on the cocktail party problem. Front Inform Technol Electron Eng, 19(1):40–63. https://doi.org/10.1631/FITEE.1700814
Radford A, Metz L, Chintala S, 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/abs/1511.06434
Roweis ST, 2001. One microphone source separation. Proc 13th Int Conf on Neural Information Processing Systems, p.793–799.
Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Proc 9th Int Conf on Spoken Language Processing.
Smaragdis P, 2007. Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process, 15(1):1–12. https://doi.org/10.1109/TASL.2006.876726
van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(11):2579–2605.
Vincent E, Gribonval R, Fevotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462–1469. https://doi.org/10.1109/TSA.2005.858005
Wang DL, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, USA.
Wang YN, Du J, Dai LR, et al., 2016. Unsupervised single-channel speech separation via deep neural network for different gender mixtures. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1–4. https://doi.org/10.1109/APSIPA.2016.7820736
Wang YX, Narayanan A, Wang DL, 2014. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process, 22(12):1849–1858. https://doi.org/10.1109/TASLP.2014.2352935
Williamson DS, 2018. Monaural speech separation using a phase-aware deep denoising auto encoder. Proc IEEE 28th Int Workshop on Machine Learning for Signal Processing, p.1–6. https://doi.org/10.1109/MLSP.2018.8516918
Xia LM, Wang H, Guo WT, 2019. Gait recognition based on Wasserstein generating adversarial image inpainting network. J Cent South Univ, 26(10):2759–2770. https://doi.org/10.1007/s11771-019-4211-7
Yu D, Kolbæk M, Tan ZH, et al., 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.241–245. https://doi.org/10.1109/ICASSP.2017.7952154
Zhang QJ, Zhang L, 2018. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Front Comput Sci, 12(6):1140–1148. https://doi.org/10.1007/s11704-016-6107-0
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the Key Project of the National Natural Science Foundation of China (No. U1836220), the National Natural Science Foundation of China (No. 61672267), the Qing Lan Talent Program of Jiangsu Province, China, and the Key Innovation Project of Undergraduate Students in Jiangsu Province, China (No. 201810299045Z)
Contributors
Jing-jing CHEN and Qi-rong MAO designed the research. Jing-jing CHEN processed the data. Jing-jing CHEN and Qi-rong MAO drafted the manuscript. You-cai QIN, Shuang-qing QIAN, and Zhi-shen ZHENG helped organize the manuscript. Jing-jing CHEN and Qi-rong MAO revised and finalized the paper.
Compliance with ethics guidelines
Jing-jing CHEN, Qi-rong MAO, You-cai QIN, Shuangqing QIAN, and Zhi-shen ZHENG declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Chen, Jj., Mao, Qr., Qin, Yc. et al. Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder. Front Inform Technol Electron Eng 21, 1639–1650 (2020). https://doi.org/10.1631/FITEE.2000019
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2000019