Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification

Yu Zhang; Zeng, Jinfang; Li, Youming; Chen, Da

doi:10.3103/S0146411621040106

Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification

Published: 02 September 2021

Volume 55, pages 311–318, (2021)
Cite this article

Automatic Control and Computer Sciences Aims and scope Submit manuscript

Yu Zhang¹,
Jinfang Zeng¹,
Youming Li¹ &
…
Da Chen¹

291 Accesses
8 Citations
Explore all metrics

Abstract

With the popular application of deep learning-based models in various classification problems, more and more researchers have applied these models to environmental sound classification (ESC) tasks in recent years. However, the performance of existing models that use acoustic features such as log-scaled mel spectrogram (Log mel) and mel frequency cepstral coefficient or raw waveform to train deep neural networks for ESC is unsatisfactory. In this paper, first of all, a fusion of multiple features consisting of Log mel, log-scaled cochleagram and log-scaled constant-Q transform are proposed, and these features are fused to form the feature set that is called LMCC. Then, a network called CNN-GRUNN which consists of convolutional neural network and gated recurrent unit neural network in parallel is presented to improve the performance of ESC with the proposed aggregated features. Experiments were conducted on ESC-10, ESC-50, and UrbanSound8K datasets. The experimental results indicate that the model with LMCC as input to CNN-GRUNN is appropriate for ESC problems. And our model is able to achieve good classification accuracy for the three datasets, i.e., ESC-10 (92.30%), ESC-50 (87.43%), and UrbanSound8K (96.10%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning algorithms to forecast air quality: a survey

Article Open access 16 February 2023

Manuel Méndez, Mercedes G. Merayo & Manuel Núñez

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A survey of music emotion recognition

Article 22 January 2022

Donghong Han, Yanru Kong, … Guoren Wang

REFERENCES

Baum, E., Harper, M., Alicea, R., and Ordonez, C., Sound identification for fire-fighting mobile robots, 2018 Second IEEE Int. Conf. Rob. Comput. (IRC), 2018, pp. 79–86.
Liu, J.M., You, M., Li, G.Z., Wang, Z., Xu, X., Qiu, Z., Xie, W., An, C., and Chen, S., Cough signal recognition with Gammatone cepstral coefficients, IEEE China Summit Int. Conf. Signal Inf. Process., 2013, pp. 160–164.
Ali, H., Tran, S.N., Benetos, E., and d’Avila Garcez, A.S., Speaker recognition with hybrid features from a deep belief network, Neural Comput. Appl., 2018, vol. 29, no. 6, pp. 13–19.
Article Google Scholar
Ghosal, D. and Kolekar, M.H., Music genre recognition using deep neural networks and transfer learning, Interspeech, 2018, pp. 2087–2091.
Book Google Scholar
Sahidullah, M. and Saha, G., Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, Speech Commun., 2012, vol. 54, no. 4, pp. 543–565.
Article Google Scholar
Chu, S., Narayanan, S., and Kuo, C.C.J., Environmental sound recognition with time-frequency audio features, IEEE Trans. Audio Speech Lang. Process., 2009, vol. 17, no. 6, pp. 1142–1158.
Article Google Scholar
Valero, X. and Alias, F., Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification, IEEE Trans. Multimedia, 2012, vol. 14, no. 6, pp. 1684–1689.
Article Google Scholar
Geiger, J.T. and Helwani, K., Improving event detection for audio surveillance using Gabor filterbank features, 23rd Eur. Signal Process. Conf., 2015, pp. 714–718.
Jin, Z., Zhou, G., Gao, D., and Zhang, Y., EEG classification using sparse Bayesian extreme learning machine for brain–computer interface, Neural Comput. Appl., 2018, pp. 1–9.
Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.
Wang, J.C., Wang, J.F., He, K.W., and Hsu, C.S., Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor, Proc. Int. Jt. Conf. Neural Netw., 2006, pp. 1731–1735.
Zhang, Y., Wang, Y., Zhou, G., Jin, J., Wang, B., Wang, X., and Cichocki, A., Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces, Expert Syst. Appl., 2018, vol. 96, pp. 302–310.
Article Google Scholar
Zhang, H., McLoughlin, I., and Song, Y., Robust sound event recognition using convolutional neural networks, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 559–563.
LeCun, Y., Bengio, Y., and Hinton, G., Deep learning, Nature, 2015, vol. 521, no. 7553, pp. 436–444.
Article Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G.E., Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
Palaz, D., Magimai.-Doss, M., and Collobert, R., Analysis of CNN-based speech recognition system using raw speech as input, IDIAP, 2015.
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y., On the properties of neural machine translation: Encoder-decoder approaches, arXiv:1409.1259.
Piczak, K.J., ESC: Dataset for environmental sound classification, 23rd ACM Int. Conf. Multimedia, 2015, pp. 1015–1018.
Salamon, J., Jacoby, C., and Bello, J.P., A dataset and taxonomy for urban sound research, 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044.
Piczak, K.J., Environmental sound classification with convolutional neural networks, IEEE Int. Workshop Mach. Learn. Signal Process., 2015, pp. 1–6.
Vu, T.H. and Wang, J.C., Acoustic scene and event recognition using recurrent neural networks, Detection and Classification of Acoustic Scenes and Events, 2016.
Google Scholar
Bae, S.H., Choi, I., and Kim, N.S., Acoustic scene classification using parallel combination of LSTM and CNN, Detection and Classification of Acoustic Scenes and Events, 2016, pp. 11–15.
Google Scholar
Aytar, Y., Vondrick, C., and Torralba, A., Soundnet: Learning sound representations from unlabeled video, Adv. Neural Inf. Process. Syst., 2016, pp. 892–900.
Dai, W., Dai, C., Qu, S., Li, J., and Das, S., Very deep convolutional neural networks for raw waveforms, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 421–425.
Tokozume, Y. and Harada, T., Learning environmental sounds with end-to-end convolutional neural network, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 2721–2725.
Tokozume, Y., Ushiku, Y., and Harada, T., Learning from between-class examples for deep sound recognition, arXiv:1711.10282
Sang, J., Park, S., and Lee, J., Convolutional recurrent neural networks for urban sound classification using raw waveforms, 26th Eur. Signal Process. Conf., 2018, pp. 2444–2448.
Zhang, Z., Xu, S., Zhang, S., Qiao, T., and Cao, S., Learning attentive representations for environmental sound classification, IEEE Access, 2019, vol. 7, pp. 130327–130339.
Article Google Scholar
Zhang, Z., Xu, S., Qiao, T., Zhang, S., and Cao, S., Attention based convolutional recurrent neural network for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2019, pp. 261–271.
Takahashi, N., Gygli, M., Pfister, B., and Van Gool, L., Deep convolutional neural networks and data augmentation for acoustic event detection, arXiv:1604.07160
Sailor, H.B., Agrawal, D.M., and Patil, H.A., Unsupervised filterbank learning using convolutional restricted Boltzmann machine for environmental sound classification, Interspeech, 2017, pp. 3107–3111.
Book Google Scholar
Tak, R.N., Agrawal, D.M., and Patil, H.A., Novel phase encoded mel filterbank energies for environmental sound classification, Int. Conf. Pattern Recognit. Mach. Intell., 2017, pp. 317–325.
Agrawal, D.M., Sailor, H.B., Soni, M.H., and Patil, H.A., Novel TEO-based gammatone features for environmental sound classification, 25th Eur. Signal Process. Conf., 2017, pp. 1809–1813.
Boddapati, V., Petef, A., Rasmusson, J., and Lundberg, L., Classifying environmental sounds using image recognition networks, Procedia Comput. Sci., 2017, vol. 112, pp. 2048–2056.
Article Google Scholar
Zhang, Z., Xu, S., Cao, S., and Zhang, S., Deep convolutional neural network with mixup for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2018, pp. 356–367.
Su, Y., Zhang, K., Wang, J., and Madani, K., Environment sound classification using a two-stream CNN based on decision-level fusion, Sensors, 2019, vol. 19, no. 7, pp. 1733.
Article Google Scholar
Sharma, J., Granmo, O.C., and Goodwin, M., Environment sound classification using multiple feature channels and deep convolutional neural networks, arXiv:1908.11219
Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.
Xing, Z., Baik, E., Jiao, Y., Kulkarni, N., Li, C., Muralidhar, G., Parandehgheibi, M., Reed, E., Singhal, A., Xiao, F., and Pouliot, C., Modeling of the latent embedding of music using deep neural network, arXiv:1705.05229
Gao, B., Woo, W.L., and Khor, L.C., Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation, J. Acoust. Soc. Am., 2014, vol. 135, no. 3, pp. 1171–1185.
Article Google Scholar
Sharan, R.V. and Moir, T.J., Pseudo-color cochleagram image feature and sequential feature selection for robust acoustic event recognition, Appl. Acoust., 2018, vol. 140, pp. 198–204.
Article Google Scholar
Brown, J.C., Calculation of a constant Q spectral transform, J. Acoust. Soc. Am., 1991, vol. 89, no. 1, pp. 425–434.
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Physics and Optoelectronic Engineering, Xiangtan University, 411105, Xiangtan, China
Yu Zhang, Jinfang Zeng, Youming Li & Da Chen

Authors

Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinfang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Youming Li
View author publications
You can also search for this author in PubMed Google Scholar
Da Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinfang Zeng.

Ethics declarations

The authors declare no conflict of interest.

About this article

Cite this article

Yu Zhang, Zeng, J., Li, Y. et al. Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification. Aut. Control Comp. Sci. 55, 311–318 (2021). https://doi.org/10.3103/S0146411621040106

Download citation

Received: 19 May 2020
Revised: 11 September 2020
Accepted: 05 November 2020
Published: 02 September 2021
Issue Date: July 2021
DOI: https://doi.org/10.3103/S0146411621040106

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions