Skip to main content
Log in

Compute and Memory Efficient Universal Sound Source Separation

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Recent progress in audio source separation led by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions. This mechanism enables our models to obtain high fidelity signal separation in a wide variety of settings where a variable number of sources are present and with limited computational resources (e.g. floating point operations, memory footprint, number of parameters and latency). Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks with significantly higher computational resource requirements. The causal variation of SuDoRM-RF is able to obtain competitive performance in real-time speech separation of around 10dB scale-invariant signal-to-distortion ratio improvement (SI-SDRi) while remaining up to 20 times faster than real-time on a laptop device.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

Notes

  1. Code: https://github.com/etzinis/sudo_rm_rf

References

  1. Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. arXiv:1607.06450.

  2. Brunner, G., Naas, N., Palsson, S., Richter, O., & Wattenhofer, R. (2019). Monaural music source separation using a resnet latent separator network. In Proc. ICTAI (pp. 1124–1131).

  3. Cai, H., Gan, C., Wang, T., Zhang, Z., & Han, S. (2020). Once for all: Train one network and specialize it for efficient deployment. In Proc. ICLR.

  4. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proc. CVPR (pp. 1251–1258).

  5. Défossez, A., Usunier, N., Bottou, L., & Bach, F. (2019). Music source separation in the waveform domain. arXiv:1911.13254.

  6. Défossez, A., Synnaeve, G., & Adi, Y. (2020). Real time speech enhancement in the waveform domain. In Proc. Interspeech (pp. 3291–3295).

  7. Haris, M., Shakhnarovich, G., & Ukita, N. (2018). Deep back-projection networks for super-resolution. In Proc. CVPR (pp. 1664–1673).

  8. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proc. CVPR (pp. 1026–1034).

  9. Hennequin, R., Khlif, A., Voituret, F., & Moussallam, M. (2020). Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50), 2154.

    Article  Google Scholar 

  10. Hershey, J.R., Chen, Z., Le Roux, J., & Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP (pp. 31–35).

  11. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.

  12. Huang, P.S., Kim, M., Hasegawa-Johnson, M., & Smaragdis, P. (2014). Deep learning for monaural speech separation. In Proc. ICASSP (pp. 1562–1566).

  13. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In International conference on machine learning (pp. 2410–2419).

  14. Kaspersen, E.T., Kounalakis, T., & Erkut, C. (2020). Hydranet: A real-time waveform separation network. In Proc. ICASSP (pp. 4327–4331).

  15. Kavalerov, I., Wisdom, S., Erdogan, H., Patton, B., Wilson, K., Roux, J.L., & Hershey, J.R. (2019). Universal sound separation. In Proc. WASPAA (pp. 175–179).

  16. Kim, M., & Smaragdis, P. (2018). Efficient source separation using bitwise neural networks. In Audio source separation (pp. 187–206). Springer.

  17. Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.

  18. Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F. (2016). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In Proc. IPSN (pp. 1–12).

  19. Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J.R. (2019). Sdr–half-baked or well done?. In Proc. ICASSP (pp. 626– 630).

  20. Liu, Y., & Wang, D. (2019). Divide and conquer: A deep casa approach to talker-independent monaural speaker separation. arXiv:1904.11148.

  21. Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898–4906).

  22. Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In Proc. ICASSP.

  23. Luo, Y., Han, C., & Mesgarani, N. (2020). Ultra-lightweight speech separation via group communication. arXiv:2011.08397.

  24. Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266.

    Article  Google Scholar 

  25. Maldonado, A., Rascon, C., & Velez, I. (2020). Lightweight online separation of the sound source of interest through blstm-based binary masking. arXiv:2002.11241.

  26. Mehta, S., Rastegari, M., Shapiro, L., & Hajishirzi, H. (2019). Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proc. CVPR (pp. 9190–9200).

  27. Pandey, A., & Wang, D. (2019). Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proc. ICASSP (pp. 6875–6879).

  28. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ..., Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems.

  29. Paul, D.B., & Baker, J.M. (1992). The design for the wall street journal-based CSR corpus. In Speech and natural language: proceedings of a workshop held at harriman, February 23-26, 1992. New York.

  30. Piczak, K.J. (2015). Esc: Dataset for environmental sound classification. In Proc. ACM international conference on multimedia (pp. 1015–1018).

  31. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer.

  32. Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. In Proc. ACL (pp. 464–468).

  33. Sifre, L., & Mallat, S. (2014). Rigid-motion scattering for image classification. Ph. D Thesis.

  34. Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop Proc. ICLR.

  35. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., & Zhong, J. (2021). Attention is all you need in speech separation. In Proc. ICASSP. To appear.

  36. Tzinis, E., Venkataramani, S., Wang, Z., Subakan, C., & Smaragdis, P. (2020). Two-step sound source separation: Training on learned latent targets. In Proc. ICASSP.

  37. Tzinis, E., Wisdom, S., Hershey, J.R., Jansen, A., & Ellis, D.P. (2020). Improving universal sound separation using sound classification. In Proc. ICASSP.

  38. Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1462–1469.

    Article  Google Scholar 

  39. Wisdom, S., Erdogan, H., Ellis, D., Serizel, R., Turpault, N., Fonseca, E., Salamon, J., Seetharaman, P., & Hershey, J. (2020). What’s all the fuss about free universal sound separation data? arXiv:2011.00803.

  40. Wisdom, S., Hershey, J.R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., & Saurous, R.A. (2019). Differentiable consistency constraints for improved deep speech enhancement. In Proc. ICASSP (pp. 900–904).

  41. Yu, D., Kolbæk, M., Tan, Z.H., & Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP (pp. 241–245).

  42. Yu, J., Yang, L., Xu, N., Yang, J., & Huang, T. (2019). Slimmable neural networks. In Proc. ICLR.

  43. Zeghidour, N., & Grangier, D. (2020). Wavesplit: End-to-end speech separation by speaker clustering. arXiv:2002.08933.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Efthymios Tzinis.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tzinis, E., Wang, Z., Jiang, X. et al. Compute and Memory Efficient Universal Sound Source Separation. J Sign Process Syst 94, 245–259 (2022). https://doi.org/10.1007/s11265-021-01683-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-021-01683-x

Keywords

Navigation