Skip to main content
Log in

Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification

  • Published:
Automatic Control and Computer Sciences Aims and scope Submit manuscript

Abstract

With the popular application of deep learning-based models in various classification problems, more and more researchers have applied these models to environmental sound classification (ESC) tasks in recent years. However, the performance of existing models that use acoustic features such as log-scaled mel spectrogram (Log mel) and mel frequency cepstral coefficient or raw waveform to train deep neural networks for ESC is unsatisfactory. In this paper, first of all, a fusion of multiple features consisting of Log mel, log-scaled cochleagram and log-scaled constant-Q transform are proposed, and these features are fused to form the feature set that is called LMCC. Then, a network called CNN-GRUNN which consists of convolutional neural network and gated recurrent unit neural network in parallel is presented to improve the performance of ESC with the proposed aggregated features. Experiments were conducted on ESC-10, ESC-50, and UrbanSound8K datasets. The experimental results indicate that the model with LMCC as input to CNN-GRUNN is appropriate for ESC problems. And our model is able to achieve good classification accuracy for the three datasets, i.e., ESC-10 (92.30%), ESC-50 (87.43%), and UrbanSound8K (96.10%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1.

Similar content being viewed by others

REFERENCES

  1. Baum, E., Harper, M., Alicea, R., and Ordonez, C., Sound identification for fire-fighting mobile robots, 2018 Second IEEE Int. Conf. Rob. Comput. (IRC), 2018, pp. 79–86.

  2. Liu, J.M., You, M., Li, G.Z., Wang, Z., Xu, X., Qiu, Z., Xie, W., An, C., and Chen, S., Cough signal recognition with Gammatone cepstral coefficients, IEEE China Summit Int. Conf. Signal Inf. Process., 2013, pp. 160–164.

  3. Ali, H., Tran, S.N., Benetos, E., and d’Avila Garcez, A.S., Speaker recognition with hybrid features from a deep belief network, Neural Comput. Appl., 2018, vol. 29, no. 6, pp. 13–19.

    Article  Google Scholar 

  4. Ghosal, D. and Kolekar, M.H., Music genre recognition using deep neural networks and transfer learning, Interspeech, 2018, pp. 2087–2091.

    Book  Google Scholar 

  5. Sahidullah, M. and Saha, G., Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, Speech Commun., 2012, vol. 54, no. 4, pp. 543–565.

    Article  Google Scholar 

  6. Chu, S., Narayanan, S., and Kuo, C.C.J., Environmental sound recognition with time-frequency audio features, IEEE Trans. Audio Speech Lang. Process., 2009, vol. 17, no. 6, pp. 1142–1158.

    Article  Google Scholar 

  7. Valero, X. and Alias, F., Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification, IEEE Trans. Multimedia, 2012, vol. 14, no. 6, pp. 1684–1689.

    Article  Google Scholar 

  8. Geiger, J.T. and Helwani, K., Improving event detection for audio surveillance using Gabor filterbank features, 23rd Eur. Signal Process. Conf., 2015, pp. 714–718.

  9. Jin, Z., Zhou, G., Gao, D., and Zhang, Y., EEG classification using sparse Bayesian extreme learning machine for brain–computer interface, Neural Comput. Appl., 2018, pp. 1–9.

  10. Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.

  11. Wang, J.C., Wang, J.F., He, K.W., and Hsu, C.S., Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor, Proc. Int. Jt. Conf. Neural Netw., 2006, pp. 1731–1735.

  12. Zhang, Y., Wang, Y., Zhou, G., Jin, J., Wang, B., Wang, X., and Cichocki, A., Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces, Expert Syst. Appl., 2018, vol. 96, pp. 302–310.

    Article  Google Scholar 

  13. Zhang, H., McLoughlin, I., and Song, Y., Robust sound event recognition using convolutional neural networks, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 559–563.

  14. LeCun, Y., Bengio, Y., and Hinton, G., Deep learning, Nature, 2015, vol. 521, no. 7553, pp. 436–444.

    Article  Google Scholar 

  15. Krizhevsky, A., Sutskever, I., and Hinton, G.E., Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

  16. Palaz, D., Magimai.-Doss, M., and Collobert, R., Analysis of CNN-based speech recognition system using raw speech as input, IDIAP, 2015.

    Google Scholar 

  17. Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y., On the properties of neural machine translation: Encoder-decoder approaches, arXiv:1409.1259.

  18. Piczak, K.J., ESC: Dataset for environmental sound classification, 23rd ACM Int. Conf. Multimedia, 2015, pp. 1015–1018.

  19. Salamon, J., Jacoby, C., and Bello, J.P., A dataset and taxonomy for urban sound research, 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044.

  20. Piczak, K.J., Environmental sound classification with convolutional neural networks, IEEE Int. Workshop Mach. Learn. Signal Process., 2015, pp. 1–6.

  21. Vu, T.H. and Wang, J.C., Acoustic scene and event recognition using recurrent neural networks, Detection and Classification of Acoustic Scenes and Events, 2016.

    Google Scholar 

  22. Bae, S.H., Choi, I., and Kim, N.S., Acoustic scene classification using parallel combination of LSTM and CNN, Detection and Classification of Acoustic Scenes and Events, 2016, pp. 11–15.

    Google Scholar 

  23. Aytar, Y., Vondrick, C., and Torralba, A., Soundnet: Learning sound representations from unlabeled video, Adv. Neural Inf. Process. Syst., 2016, pp. 892–900.

  24. Dai, W., Dai, C., Qu, S., Li, J., and Das, S., Very deep convolutional neural networks for raw waveforms, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 421–425.

  25. Tokozume, Y. and Harada, T., Learning environmental sounds with end-to-end convolutional neural network, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 2721–2725.

  26. Tokozume, Y., Ushiku, Y., and Harada, T., Learning from between-class examples for deep sound recognition, arXiv:1711.10282

  27. Sang, J., Park, S., and Lee, J., Convolutional recurrent neural networks for urban sound classification using raw waveforms, 26th Eur. Signal Process. Conf., 2018, pp. 2444–2448.

  28. Zhang, Z., Xu, S., Zhang, S., Qiao, T., and Cao, S., Learning attentive representations for environmental sound classification, IEEE Access, 2019, vol. 7, pp. 130327–130339.

    Article  Google Scholar 

  29. Zhang, Z., Xu, S., Qiao, T., Zhang, S., and Cao, S., Attention based convolutional recurrent neural network for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2019, pp. 261–271.

  30. Takahashi, N., Gygli, M., Pfister, B., and Van Gool, L., Deep convolutional neural networks and data augmentation for acoustic event detection, arXiv:1604.07160

  31. Sailor, H.B., Agrawal, D.M., and Patil, H.A., Unsupervised filterbank learning using convolutional restricted Boltzmann machine for environmental sound classification, Interspeech, 2017, pp. 3107–3111.

    Book  Google Scholar 

  32. Tak, R.N., Agrawal, D.M., and Patil, H.A., Novel phase encoded mel filterbank energies for environmental sound classification, Int. Conf. Pattern Recognit. Mach. Intell., 2017, pp. 317–325.

  33. Agrawal, D.M., Sailor, H.B., Soni, M.H., and Patil, H.A., Novel TEO-based gammatone features for environmental sound classification, 25th Eur. Signal Process. Conf., 2017, pp. 1809–1813.

  34. Boddapati, V., Petef, A., Rasmusson, J., and Lundberg, L., Classifying environmental sounds using image recognition networks, Procedia Comput. Sci., 2017, vol. 112, pp. 2048–2056.

    Article  Google Scholar 

  35. Zhang, Z., Xu, S., Cao, S., and Zhang, S., Deep convolutional neural network with mixup for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2018, pp. 356–367.

  36. Su, Y., Zhang, K., Wang, J., and Madani, K., Environment sound classification using a two-stream CNN based on decision-level fusion, Sensors, 2019, vol. 19, no. 7, pp. 1733.

    Article  Google Scholar 

  37. Sharma, J., Granmo, O.C., and Goodwin, M., Environment sound classification using multiple feature channels and deep convolutional neural networks, arXiv:1908.11219

  38. Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.

  39. Xing, Z., Baik, E., Jiao, Y., Kulkarni, N., Li, C., Muralidhar, G., Parandehgheibi, M., Reed, E., Singhal, A., Xiao, F., and Pouliot, C., Modeling of the latent embedding of music using deep neural network, arXiv:1705.05229

  40. Gao, B., Woo, W.L., and Khor, L.C., Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation, J. Acoust. Soc. Am., 2014, vol. 135, no. 3, pp. 1171–1185.

    Article  Google Scholar 

  41. Sharan, R.V. and Moir, T.J., Pseudo-color cochleagram image feature and sequential feature selection for robust acoustic event recognition, Appl. Acoust., 2018, vol. 140, pp. 198–204.

    Article  Google Scholar 

  42. Brown, J.C., Calculation of a constant Q spectral transform, J. Acoust. Soc. Am., 1991, vol. 89, no. 1, pp. 425–434.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinfang Zeng.

Ethics declarations

The authors declare no conflict of interest.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu Zhang, Zeng, J., Li, Y. et al. Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification. Aut. Control Comp. Sci. 55, 311–318 (2021). https://doi.org/10.3103/S0146411621040106

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0146411621040106

Keywords:

Navigation