Data augmentation approaches for improving animal audio classification

doi:10.1016/j.ecoinf.2020.101084

Ecological Informatics

Volume 57, May 2020, 101084

https://doi.org/10.1016/j.ecoinf.2020.101084 Get rights and content

Highlights

•
We apply convolutional neural networks for classifying animal audio.
•
We test many data augmentation techniques applied on raw audio and on spectrograms.
•
Our ensemble improves the state of the art in two animal audio datasets.

Abstract

In this paper we present ensembles of classifiers for automated animal audio classification, exploiting different data augmentation techniques for training Convolutional Neural Networks (CNNs). The specific animal audio classification problems are i) birds and ii) cat sounds, whose datasets are freely available. We train five different CNNs on the original datasets and on their versions augmented by four augmentation protocols, working on the raw audio signals or their representations as spectrograms. We compared our best approaches with the state of the art, showing that we obtain the best recognition rate on the same datasets, without ad hoc parameter optimization. Our study shows that different CNNs can be trained for the purpose of animal audio classification and that their fusion works better than the stand-alone classifiers. To the best of our knowledge this is the largest study on data augmentation for CNNs in animal audio classification audio datasets using the same set of classifiers and parameters. Our MATLAB code is available at https://github.com/LorisNanni.

Introduction

In the current context of constantly increasing environmental awareness, highly accurate sound recognition systems can play a pivotal role in mitigating or managing threats like the increasing risk of animal species loss or climate changes affecting the wildlife fauna (Zhao et al., 2017). Sound classification and recognition has been included among the pattern recognition tasks for different application domains, e.g. speech recognition (Padmanabhan and Johnson Premkumar, 2015), music classification (Nanni et al., 2017), environmental sound recognition or biometric identification (Sahoo et al., 2012). In the traditional pattern recognition framework (preprocessing, feature extraction and classification) features have generally been extracted from the actual audio traces (e.g. Statistical Spectrum Descriptor or Rhythm Histogram (Lidy and Rauber, 2005)). However, the conversion of audio traces into their visual representations enabled the use of feature extraction techniques commonly used for image classification. The most common visual representation of audio traces displays the spectrum of frequencies of the original traces as it varies with time, e.g. spectrograms (Wyse, 2017), Mel-frequency Cepstral Coefficients spectrograms (Rubin et al., 2016) and other representations derived from these. A spectrogram can be described as a bidimensional graph with two geometric dimensions (time and frequency) plus a third dimension encoding the signal amplitude in a specific frequency at a particular time step as pixel intensity (Nanni et al., 2014a). For example, (Costa et al., 2011; Costa et al., 2012) applied many texture analysis and classification techniques to music genre classification. In (Costa et al., 2011) the grey level co-occurrence matrices (GLCMs) (Haralick et al., 1979) were computed on spectrograms as features to train support vector machines (SVMs) on the Latin Music Database (LMD) (Silla Jr et al., 2008). Similarly, in Costa et al. (2012) they used one of the most famous texture descriptor, the local binary pattern (LBP) (Ojala et al., 2002), again to train SVMs on the LMD and the ISMIR04 (Cano et al., 2006) datasets, improving the accuracy of their classification with respect to their previous work. Again in 2013 (Costa et al., 2013), they used the same approach, but using local phase quantization (LPQ) and Gabor filters (Ojansivu and Heikkilä, 2008) for feature extraction. This actually marked an interesting parallel in the development of more and more refined texture descriptors for image classification and their application also to sound recognition. In 2017, Nanni et al. (2017) presented the fusion of state-of-the-art texture descriptors with acoustic features extracted from the audio traces on multiple dataset, demonstrating how such fusion greatly improved the accuracy of a system based only on acoustic or visual features. However, with the diffusion of deep learning and the availability of more and more powerful Graphic Processing Units (GPUs) at accessible costs, i) the canonical pattern recognition framework changed and ii) the attention was polarized on visual representations of acoustic traces. The optimization of the feature extraction step had a key role in the canonical framework, especially with the development of handcrafted features that place patterns from the same class closer to each other in the feature space, simultaneously maximizing their distance from other classes. Since deep classifiers learn the best features for describing patterns during the training process, the aforementioned feature engineering lost part of its importance and it has been coupled with the direct use of the visual representation of audio traces, letting the classifiers selecting the most informative features. Another reason for representing the patterns as images at the beginning of the pipeline is the intrinsic architecture of the most famous deep classifiers, such as convolutional neural networks (CNN), which require images as their input. This motivated researchers using CNNs in audio classification to advance methods for the conversion of audio signals into time-frequency images.

Among the first studies using deep learning for audio images, Humphrey and Bello (Humphrey et al., 2012; Humphrey and Bello, 2012) explored CNNs as alternatives to addressed music classification problems, defining the state of the art in automatic chord detection and recognition. Nakashika et al. (2012) performed music genre classification on the GTZAN dataset (Tzanetakis and Cook, 2002) converting spectrograms into GCLM maps to train CNNs. Costa et al. (2017) fused canonical approaches, e.g. LBP-trained SVMs with CNNs, performing better that the state of the art on the LMD dataset.

In addition to approaches derived directly from image classification, few studies focused on different classification aspects, in order to make such process more specific for sound recognition. Sigtia and Dixon (2014) aimed to adjust CNN parameters and structures, and showed how the training time was reduced by replacing sigmoid units with Rectified Linear Units (ReLu) and stochastic gradient descent with the Hessian Free optimization. Wang et al. (2017) proposed a novel CNN called a sparse coding CNN for sound event recognition and retrieval, obtaining competitive and sometimes better results than most of the other approaches when evaluating the performance under noisy and clean conditions. Another hybrid approach by Oramas et al. (2017) combined different modalities (album cover images, reviews and audio tracks) for multi-label music genre classification using deep learning methods appropriate for each modality and outperforming the unimodal methods.

The clear improvement in classification performances introduced by deep classifiers led to apply sound recognition also to other tasks, such as the biodiversity assessment or monitoring animal species at risk. For example, birds have been acknowledged as biological indicators for ecological research. Therefore, their observation and monitoring are increasingly important for biodiversity conservation, with the additional advantage that the acquisition of video and audio information is minimally invasive. To date, many datasets are available to develop classifiers to identify and monitor different species such as birds (Acevedo et al., 2009; Cullinan et al., 2015), whales (Fristrup and Watkins, 1993), frogs (Acevedo et al., 2009), bats (Cullinan et al., 2015), cats (Pandeya et al., 2018). For instance, Cao et al. (2015) combined a CNN with handcrafted features to classify marine animals (Nanni et al., 2014b) (the Fish and MBARI benthic animal dataset (Edgington et al., 2006)). Salamon et al. (2017) investigated the use of fusing deep learning (using CNN) and shallow learning for the problem of bird species identification, based on 5428 bird flight calls from 43 species. In both these works, the fusion of CNNs with mode canonical techniques outperformed the single approach.

One of the main drawbacks of deep learning approaches is the need of great amount of training data (Marcus, 2018), in this case audio signals and consequently their visual representations. In case of limited amount of training images, data augmentation is a powerful tool. Animal sound datasets are usually much smaller than necessary, since the sample collection and labelling can be very expensive. Commonly, audio signals can be augmented in the time and/or in the frequency domains directly on the raw signals or after their conversion into spectrograms. In Lasseck (2018) different augmentation techniques were applied to the training set for the BirdCLEF 2018 initiative (www.imageclef.org/node/230) that included over 30,000 bird sound samples ranging over 1500 species. Bird audio signals were first augmented in the time domain by e.g. extracting chunks from random position in each file, applying jitter to duration, add two audio chunks from random files background noise and background atmospheric noise, applying random cyclic shift and time interval dropout. Every augmented audio chunk was then converted into spectrogram and then further augmented in the frequency domain by pitch shift and frequency stretch, piecewise time stretch and frequency stretch and applying color jittering. The influence of the complete augmentation led improve by almost 10% the identification performance quantified as Mean Reciprocal Rank. In the field of animal audio classification, Sprengel et al. (2016) used standard audio augmentation techniques for bird audio classification, such as time and pitch shift. Besides, they created more samples by summing two different samples belonging to the same class. This is motivated by the fact that the sound of two birds from the same class should still be correctly classified. Pandeya et al. (2018) demonstrated that audio signal augmentation by simple techniques as random selection of time stretching, pitch shifting, dynamic range compression, and insertion of noise on the domestic cat sound dataset, described in Section 5 of this paper, improved accuracy, F1-score and area under ROC curve. In particular, the performance improvement increased by including more augmented clones (one to three) per single original audio file. Conversely, Oikarinen et al. (2019), showed that augmenting their spectrograms by translations, adding random noise, and multiplying the input by a random value close to one, did not significantly improve their classification of marmoset audio signals. Of note, the aim of Oikarinen et al. was not the classification of species or call types only, e.g. from publicly available datasets, but the identification of call types and the source animal in a complex experimental setup consisting of multiple cages in one room, each cage containing two marmosets. Other techniques, inherited from e.g. speech recognition, are also suitable for animal sound classification. For instance, Jaitly and Hinton (2013) proposed Vocal Track Length Perturbation (VTLP), which alters the vocal tract length during the extraction of a descriptor to create a new sample. They show that this technique is very effective in speech recognition. Takahashi et al. (2016) used large convolutional networks with strong data augmentation to classify audio events. They also used VTLP and introduced a new transformation that consists in summing two different perturbed samples of the same class.

In this work, we compare different sets of data augmentation approaches, each coupled with different CNNs. This way, an ensemble of networks is trained. Finally, the set of classifiers is combined by sum rule. The proposed method is tested in two different audio classification dataset: the first related to domestic cat sound classification (Pandeya et al., 2018), the latter on bird classification (Zhao et al., 2017). Our experiments were designed to compare and maximize the performance obtained by varying combinations of data augmentation approaches and classifiers and they showed that our augmentation techniques were successful at improving the classification accuracy.

Our main contributions to the community are the following:

•
Different methods for audio data augmentation are tested/proposed/compared in two datasets;
•
Exhaustive tests are performed on the fusions among ensemble system based on CNNs trained with different data augmentation approaches;
•
All MATLAB source code used in our experiments will be freely available at https://github.com/LorisNanni

Section snippets

Audio image representation

In order to get image representations for the audio signals we applied a Discrete Gabor Transform (DGT) to the signal. The DGT is a particular case of Short-Time Fourier Transform where the window function is a Gaussian kernel. The continuous Gabor transform is defined as the convolution between a Gaussian and the product of the signal with a complex exponential: $G (τ, ω) = \frac{1}{σ^{2}} \int_{- \infty}^{+ \infty} x (t) e^{iωt} e^{- π σ^{2} {(t - τ)}^{2}} dt$ where x(t) is the signal, ω is a frequency and i is the imaginary unit. The parameter σ² is the width of

Convolutional neural networks

In this work, we used CNNs both for feature extraction (to train SVMs) and for direct classification. CNNs, introduced in 1998 by LeCun et al. (1998), are deep feed-forward neural networks where neurons are connected only locally to neurons from the previous layer. Weights, biases and activation functions are iteratively adjusted during the training phase. In addition to the input layer, i.e. the image or its part to be classified, and the output/classification (CLASS) layer, composed by one

Data augmentation approaches

In this paper, we tested the following four augmentation protocols. For the third and fourth protocols we used the methods provided Audiogmenter (Maguolo et al., 2019), an audio data augmentation library for MATLAB.

Experimental results

We assessed the effects of data augmentation using a stratified ten-fold cross validation protocol and the recognition rate as the performance indicator (i.e. the average accuracy over the different folds). We tested our approach on the following two datasets of animal audio recordings:

•
BIRDZ, the control and real-world audio dataset used in (Zhao et al., 2017). The real-world recordings were downloaded from the Xeno-canto Archive (http://www.xeno-canto.org/), selecting a set of 11 widespread

Conclusion

In this paper we explored how different data augmentation techniques improve the accuracy of automated audio classification of natural sounds (bird and cat sounds) by means of deep network.

Different types of data augmentation approaches for audio signals were proposed, tested and compared. Because of the nature of these signals, data augmentation methods were applied on both on the raw audio signals and on their visual representation as spectrogram. A set of CNNs was trained using different

Acknowledgment

The authors thank NVIDIA Corporation for supporting this work by donating Titan Xp GPU and the Tampere Center for Scientific Computing for generous computational resources.

References (54)

M.A. Acevedo et al.
Automated classification of bird and amphibian calls using machine learning: a comparison of methods
Ecol. Inform.
(2009)
Y.M.G. Costa et al.
Music genre classification using LBP textural features
Signal Process.
(2012)
Y.M.G. Costa et al.
An evaluation of convolutional neural networks for music classification using spectrograms
Appl. Soft Comput.
(2017)
V.I. Cullinan et al.
Classification of birds and bats using flight tracks
Ecol. Inform.
(2015)
L. Nanni et al.
Combining visual and acoustic features for music genre classification
Expert Syst. Appl.
(2016)
L. Nanni et al.
Combining visual and acoustic features for audio classification tasks
Pattern Recogn. Lett.
(2017)
Z. Zhao et al.
Automated bird acoustic event detection and robust species classification
Ecol. Inform.
(2017)
F.L. Bookstein
Principal warps: thin-plate splines and the decomposition of deformations
IEEE Trans. Pattern Anal. Mach. Intell.
(1989)
P. Cano et al.
ISMIR 2004 Audio Description Contest
(2006)
Z. Cao et al.
Marine animal classification using combined CNN and hand-designed image features

Y.M.G. Costa et al.

Music genre recognition using spectrograms

Y. Costa et al.

Music genre recognition using gabor filters and lpq texture descriptors

D.R. Edgington et al.

Detecting, tracking and classifying animals in underwater video

K.M. Fristrup et al.

Marine Animal Sound Classification

(1993)

P. Hansen et al.

Neural network ensembles

IEEE Trans. Pattern Anal. Mach. Intell.

(1990)

R.M. Haralick

Statistical and structural approaches to texture

Proc. IEEE

(1979)

E.J. Humphrey et al.

Rethinking automatic chord recognition with convolutional neural networks

E.J. Humphrey et al.

Moving beyond feature design: deep architectures and automatic feature learning in music informatics

13th Int. Soc. Music Inf. Retr. Conf. ISMIR

(2012)

N. Jaitly et al.

Vocal tract length perturbation (VTLP) improves speech recognition

A. Krizhevsky et al.

ImageNet classification with deep convolutional neural networks

Commun. ACM

(2012)

M. Lasseck

Audio-based bird species identification with deep convolutional neural networks

Y. LeCun et al.

Gradient-based learning applied to document recognition

Proc. IEEE

(1998)

T. Lidy et al.

Evaluation of feature extractors and psycho-acoustic transformations for music genre classification

G. Maguolo et al.

Audiogmenter: a MATLAB toolbox for audio data augmentation

G. Marcus

Deep learning: a critical appraisal

T. Nakashika et al.

Local-feature-map integration using convolutional neural networks for music genre classification

L. Nanni et al.

Set of Texture Descriptors for Music Genre Classification

(2014)

Cited by (104)

On the role of audio frontends in bird species recognition
2024, Ecological Informatics
Automatic acoustic monitoring of bird populations and their diversity is in demand for conservation planning. This requirement and recent advances in deep learning have inspired sophisticated species recognizers. However, there are still open challenges in creating reliable monitoring systems of natural habitats. One of many open questions is whether predominantly used audio features like mel-filterbanks are appropriate for such analysis since their design follows human's perception of the sound, making them susceptible to discarding fine details from other animals' vocalization. Although research shows that different audio features work better for particular tasks and datasets, it is hard to attribute all advantages to input features since the experimental setups vary. A general solution is to design a learnable audio frontend to extract task-relevant features from raw waveform since it contains all the information in other audio features. The current paper thoroughly analyzes the role of such frontends in bird species recognition, which helped to evaluate the adequacy of traditional time-frequency representations (static frontends) in capturing the relevant information from bird vocalization. In particular, this work shows that the main performance gain in learnable audio frontends comes from the normalization and compression operations rather than the data-driven frequency selectivity and functional form of filters. We observed no significant discrepancy between the frequency bands of the learned and static frontends for bird vocalization. Although the performance of learnable frontends was much higher, we will show that adequate normalization and compression enhance the accuracy of traditional frontends by more than 16% to achieve comparable results for bird species recognition. Ablation studies of the frontends under different configurations and detailed analysis of noise robustness provide evidence for the conclusions, validate the use of mel-filterbanks and similar features in prior works, and provide guidelines for designing future species recognizers. The code is available at https://github.com/houtan-ghaffari/bird-frontends.
Open-source machine learning BANTER acoustic classification of beaked whale echolocation pulses
2024, Ecological Informatics
Passive acoustic monitoring is increasingly used for assessing populations of marine mammals; however, analysis of large datasets is limited by our ability to easily classify sounds detected. Classification of beaked whale acoustic events, in particular, requires evaluation of multiple lines of evidence by expert analysts. Here we present a highly automated approach to acoustic detection and classification using supervised machine learning and open source software methods. Data from four large scale surveys of beaked whales (northwestern North Atlantic, southwestern North Atlantic, Hawaii, and eastern North Pacific) were analyzed using PAMGuard (acoustic detection), PAMpal (acoustic analysis) and BANTER (hierarchical random forest classifier). Overall classification accuracy ranged from 88% for the southwestern North Atlantic data to 97% for the northwestern North Atlantic. Results for many species could likely be improved with increased sample sizes, consideration of alternative automated detectors, and addition of relevant environmental features. These methods provide a highly automated approach to acoustic detection and classification using open source methods that can be readily adopted for other species and geographic regions.
Faster inference from state space models via GPU computing
2024, Ecological Informatics
Inexpensive Graphics Processing Units (GPUs) offer the potential to greatly speed up computation by employing their massively parallel architecture to perform arithmetic operations more efficiently. Population dynamics models are important tools in ecology and conservation. Modern Bayesian approaches allow biologically realistic models to be constructed and fitted to multiple data sources in an integrated modelling framework based on a class of statistical models called state space models. However, model fitting is often slow, requiring hours to weeks of computation. We demonstrate the benefits of GPU computing using a model for the population dynamics of British grey seals, fitted with a particle Markov chain Monte Carlo algorithm. Speed-ups of two orders of magnitude were obtained for estimations of the log-likelihood, compared to a traditional ‘CPU-only’ implementation, allowing for an accurate method of inference to be used where this was previously too computationally expensive to be viable. GPU computing has enormous potential, but one barrier to further adoption is a steep learning curve, due to GPUs' unique hardware architecture. We provide a detailed description of hardware and software setup, and our case study provides a template for other similar applications. We also provide a detailed tutorial-style description of GPU hardware architectures, and examples of important GPU-specific programming practices.
Regression generative adversarial network based on bounded losses for prediction of free calcium oxide in cement clinker
2024, Advanced Engineering Informatics
The data imbalance problem caused by multi-time scales phenomenon affects the prediction accuracy, validity and robustness of free calcium oxide (fCaO) content in cement clinker calcination process. Focusing on this problem, we propose an regression generative adversarial network model to predict fCaO content, which contains a generator, discriminator and predictor. Generator and discriminator are designed as bounded loss generative adversarial network to solve mode collapse problem and improve the stability. They learn actual data features in adversarial learning style and produce fake data to enlarge the scale and feature space of actual data to train predictor and finally achieve the prediction of fCaO content, which overcomes the problem of data imbalance. For performance assessment, we visually evaluate the validity of generated data from the perspective of univariate distribution and multivariate joint distribution and invent sequence change tendency consistency index (TCI) to evaluate the robustness of fCaO content prediction. Experiments implemented by cement production data demonstrate that the proposed model has advantages in accuracy, availability and robustness in fCaO content prediction, especially TCI is higher 22.69 percentage points than that of without data augmentation.
A new method for GAN-based data augmentation for classes with distinct clusters
2024, Expert Systems with Applications
Data augmentation is a commonly used approach for addressing the issue of limited data availability in machine learning. There are various methods available, including classical and modern techniques. However, when applying modern data augmentation methods, such as Generative Adversarial Neural Networks (GANs), to a class specific data, the resulting data can exhibit structural discrepancies. This study explores a different use of GANs as a data augmentation method that solves this problem using the electrocardiogram (ECG) signals in the MIT-BIH arrhythmia dataset as the example. We begin by examining the cluster structure of a specific class using t-Distributed Stochastic Neighbor (t-SNE) method. Based on this cluster structure, we propose a new method for applying GANs to augment data for that class. We assess the effect of our method in a classification task using 1-D Convolutional Neural Network (CNN), Support Vector Machine (SVM), One vs one classifier (Ovo), K-Nearest Neighbors (KNN), and Random Forest as the classifiers. The results demonstrate that our proposed method could lead to better classification performance if a specific class has distinct clusters when compared to normal use of GANs.
Methods for processing and analyzing passive acoustic monitoring data: An example of song recognition in western black-crested gibbons
2023, Ecological Indicators
The extraction of species-specific calls from passive acoustic recordings is a common preliminary step in ecological analysis. But for many species, especially those occupying noisy, acoustically variable habitats, the call extraction process remains largely manual, a time-consuming, and increasingly unsustainable process. Deep neural networks have been shown to provide excellent performance in a range of acoustic classification applications. We take as an example the recognition of four songs of one of the rarest mammals in the world, the western black-crested gibbon (Nomascus concolors).The process of recognition in this paper which includes distributed BIC ambient sound segmentation based on the OpenPAI platform; DNN-based western black-crested gibbon song enhancement processing; data pre-processing, labeling samples; proposed DNN + ResNet34 + CBAM + GRU + Attention recognition model; comparing other classical neural network models.Our best model converts acoustic recordings into spectrogram images on the mel frequency scale and uses these images to train convolutional neural networks.Our proposed model proved to be very accurate in predicting the segmented western black-crested gibbon songs with an accuracy of 99.8%, and almost a few western black-crested gibbon songs were incorrectly identified when all segmented data were recognized. In the four consecutive years of the acoustic monitoring system deployed in the Ailao Mountain National Nature Reserve, the western black-crested gibbon was most active near the monitoring site from March to August each year, and least active in January and February. Based on call sound intensity analysis, we monitored a total of two different western black-crested gibbon groups (G1 and G2) during the monitoring cycle. We demonstrate that passive acoustic monitoring combined with CNN classifiers is an effective tool for the remote detection of one of the rarest and most threatened species in the world.

View all citing articles on Scopus

View full text

Data augmentation approaches for improving animal audio classification

Highlights

Abstract

Introduction

Section snippets

Audio image representation

Convolutional neural networks

Data augmentation approaches

Experimental results

Conclusion

Acknowledgment

Ecol. Inform.

Signal Process.

Appl. Soft Comput.

Ecol. Inform.

Expert Syst. Appl.

Pattern Recogn. Lett.

Ecol. Inform.

Principal warps: thin-plate splines and the decomposition of deformations

IEEE Trans. Pattern Anal. Mach. Intell.

ISMIR 2004 Audio Description Contest

Marine animal classification using combined CNN and hand-designed image features

Music genre recognition using spectrograms

Music genre recognition using gabor filters and lpq texture descriptors

Detecting, tracking and classifying animals in underwater video

Marine Animal Sound Classification

Neural network ensembles

IEEE Trans. Pattern Anal. Mach. Intell.

Statistical and structural approaches to texture

Proc. IEEE

Rethinking automatic chord recognition with convolutional neural networks

Moving beyond feature design: deep architectures and automatic feature learning in music informatics

13th Int. Soc. Music Inf. Retr. Conf. ISMIR

Vocal tract length perturbation (VTLP) improves speech recognition

ImageNet classification with deep convolutional neural networks

Commun. ACM

Audio-based bird species identification with deep convolutional neural networks

Gradient-based learning applied to document recognition

Proc. IEEE

Evaluation of feature extractors and psycho-acoustic transformations for music genre classification

Audiogmenter: a MATLAB toolbox for audio data augmentation

Deep learning: a critical appraisal

Local-feature-map integration using convolutional neural networks for music genre classification

Set of Texture Descriptors for Music Genre Classification