Environmental sound classification using a regularized deep convolutional neural network with data augmentation,Applied Acoustics

当前位置： X-MOL 学术 › Appl. Acoust. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Environmental sound classification using a regularized deep convolutional neural network with data augmentation
Applied Acoustics ( IF 3.4 ) Pub Date : 2020-10-01 , DOI: 10.1016/j.apacoust.2020.107389
Zohaib Mushtaq , Shun-Feng Su

Abstract The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to various environmental sounds. Some frequent and common aspects like non-uniform distance between acoustic source and microphone, the difference in the framework, presence of numerous sounds sources in audio recordings and overlapping various sound events make this ESC problem much complex and complicated. This study is to employ deep convolutional neural networks (DCNN) with regularization and data enhancement with basic audio features that have verified to be efficient on ESC tasks. In this study, the performance of DCNN with max-pooling (Model-1) and without max-pooling (Model-2) function are examined. Three audio attribute extraction techniques, Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC) and Log-Mel, are considered for the ESC-10, ESC-50, and Urban sound (US8K) datasets. Furthermore, to avoid the risk of overfitting due to limited numbers of data, this study also introduces offline data augmentation techniques to enhance the used datasets with a combination of L2 regularization. The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets. For ESC-10, ESC-50 and US8K, the highest achieved accuracies are 94.94%, 89.28%, and 95.37% respectively. The experimental results show that the proposed approach can accomplish the best performance on environment sound classification problems.

中文翻译：

使用具有数据增强的正则化深度卷积神经网络进行环境声音分类

摘要近年来，环境声分类 (ESC) 任务在我们日常生活中的应用范围广泛，因此其采用速度非常快。ESC 也称为声音事件识别 (SER)，它涉及识别与各种环境声音相关的音频流的上下文。一些常见的常见问题，如声源与麦克风之间的距离不均匀，框架的差异，录音中存在众多声源以及各种声音事件的重叠，使这个ESC问题变得更加复杂和复杂。本研究将采用具有正则化和数据增强功能的深度卷积神经网络 (DCNN)，其基本音频特征已被验证在 ESC 任务上是有效的。在这项研究中，检查了具有最大池化（模型 1）和没有最大池化（模型 2）功能的 DCNN 的性能。三种音频属性提取技术，梅尔谱图 (Mel)、梅尔频率倒谱系数 (MFCC) 和 Log-Mel，被考虑用于 ESC-10、ESC-50 和城市声音 (US8K) 数据集。此外，为了避免由于数据数量有限而导致过度拟合的风险，本研究还引入了离线数据增强技术，以结合 L2 正则化来增强所使用的数据集。性能评估表明，所提出的 DCNN 在没有最大池化函数（模型 2）的情况下并在这些增强数据集上使用 Log-Mel 音频特征提取获得了最佳精度。对于 ESC-10、ESC-50 和 US8K，达到的最高准确率分别为 94.94%、89.28% 和 95.37%。

更新日期：2020-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>