Data augmentation approaches for improving animal audio classification

https://doi.org/10.1016/j.ecoinf.2020.101084Get rights and content

Highlights

  • We apply convolutional neural networks for classifying animal audio.

  • We test many data augmentation techniques applied on raw audio and on spectrograms.

  • Our ensemble improves the state of the art in two animal audio datasets.

Abstract

In this paper we present ensembles of classifiers for automated animal audio classification, exploiting different data augmentation techniques for training Convolutional Neural Networks (CNNs). The specific animal audio classification problems are i) birds and ii) cat sounds, whose datasets are freely available. We train five different CNNs on the original datasets and on their versions augmented by four augmentation protocols, working on the raw audio signals or their representations as spectrograms. We compared our best approaches with the state of the art, showing that we obtain the best recognition rate on the same datasets, without ad hoc parameter optimization. Our study shows that different CNNs can be trained for the purpose of animal audio classification and that their fusion works better than the stand-alone classifiers. To the best of our knowledge this is the largest study on data augmentation for CNNs in animal audio classification audio datasets using the same set of classifiers and parameters. Our MATLAB code is available at https://github.com/LorisNanni.

Introduction

In the current context of constantly increasing environmental awareness, highly accurate sound recognition systems can play a pivotal role in mitigating or managing threats like the increasing risk of animal species loss or climate changes affecting the wildlife fauna (Zhao et al., 2017). Sound classification and recognition has been included among the pattern recognition tasks for different application domains, e.g. speech recognition (Padmanabhan and Johnson Premkumar, 2015), music classification (Nanni et al., 2017), environmental sound recognition or biometric identification (Sahoo et al., 2012). In the traditional pattern recognition framework (preprocessing, feature extraction and classification) features have generally been extracted from the actual audio traces (e.g. Statistical Spectrum Descriptor or Rhythm Histogram (Lidy and Rauber, 2005)). However, the conversion of audio traces into their visual representations enabled the use of feature extraction techniques commonly used for image classification. The most common visual representation of audio traces displays the spectrum of frequencies of the original traces as it varies with time, e.g. spectrograms (Wyse, 2017), Mel-frequency Cepstral Coefficients spectrograms (Rubin et al., 2016) and other representations derived from these. A spectrogram can be described as a bidimensional graph with two geometric dimensions (time and frequency) plus a third dimension encoding the signal amplitude in a specific frequency at a particular time step as pixel intensity (Nanni et al., 2014a). For example, (Costa et al., 2011; Costa et al., 2012) applied many texture analysis and classification techniques to music genre classification. In (Costa et al., 2011) the grey level co-occurrence matrices (GLCMs) (Haralick et al., 1979) were computed on spectrograms as features to train support vector machines (SVMs) on the Latin Music Database (LMD) (Silla Jr et al., 2008). Similarly, in Costa et al. (2012) they used one of the most famous texture descriptor, the local binary pattern (LBP) (Ojala et al., 2002), again to train SVMs on the LMD and the ISMIR04 (Cano et al., 2006) datasets, improving the accuracy of their classification with respect to their previous work. Again in 2013 (Costa et al., 2013), they used the same approach, but using local phase quantization (LPQ) and Gabor filters (Ojansivu and Heikkilä, 2008) for feature extraction. This actually marked an interesting parallel in the development of more and more refined texture descriptors for image classification and their application also to sound recognition. In 2017, Nanni et al. (2017) presented the fusion of state-of-the-art texture descriptors with acoustic features extracted from the audio traces on multiple dataset, demonstrating how such fusion greatly improved the accuracy of a system based only on acoustic or visual features. However, with the diffusion of deep learning and the availability of more and more powerful Graphic Processing Units (GPUs) at accessible costs, i) the canonical pattern recognition framework changed and ii) the attention was polarized on visual representations of acoustic traces. The optimization of the feature extraction step had a key role in the canonical framework, especially with the development of handcrafted features that place patterns from the same class closer to each other in the feature space, simultaneously maximizing their distance from other classes. Since deep classifiers learn the best features for describing patterns during the training process, the aforementioned feature engineering lost part of its importance and it has been coupled with the direct use of the visual representation of audio traces, letting the classifiers selecting the most informative features. Another reason for representing the patterns as images at the beginning of the pipeline is the intrinsic architecture of the most famous deep classifiers, such as convolutional neural networks (CNN), which require images as their input. This motivated researchers using CNNs in audio classification to advance methods for the conversion of audio signals into time-frequency images.

Among the first studies using deep learning for audio images, Humphrey and Bello (Humphrey et al., 2012; Humphrey and Bello, 2012) explored CNNs as alternatives to addressed music classification problems, defining the state of the art in automatic chord detection and recognition. Nakashika et al. (2012) performed music genre classification on the GTZAN dataset (Tzanetakis and Cook, 2002) converting spectrograms into GCLM maps to train CNNs. Costa et al. (2017) fused canonical approaches, e.g. LBP-trained SVMs with CNNs, performing better that the state of the art on the LMD dataset.

In addition to approaches derived directly from image classification, few studies focused on different classification aspects, in order to make such process more specific for sound recognition. Sigtia and Dixon (2014) aimed to adjust CNN parameters and structures, and showed how the training time was reduced by replacing sigmoid units with Rectified Linear Units (ReLu) and stochastic gradient descent with the Hessian Free optimization. Wang et al. (2017) proposed a novel CNN called a sparse coding CNN for sound event recognition and retrieval, obtaining competitive and sometimes better results than most of the other approaches when evaluating the performance under noisy and clean conditions. Another hybrid approach by Oramas et al. (2017) combined different modalities (album cover images, reviews and audio tracks) for multi-label music genre classification using deep learning methods appropriate for each modality and outperforming the unimodal methods.

The clear improvement in classification performances introduced by deep classifiers led to apply sound recognition also to other tasks, such as the biodiversity assessment or monitoring animal species at risk. For example, birds have been acknowledged as biological indicators for ecological research. Therefore, their observation and monitoring are increasingly important for biodiversity conservation, with the additional advantage that the acquisition of video and audio information is minimally invasive. To date, many datasets are available to develop classifiers to identify and monitor different species such as birds (Acevedo et al., 2009; Cullinan et al., 2015), whales (Fristrup and Watkins, 1993), frogs (Acevedo et al., 2009), bats (Cullinan et al., 2015), cats (Pandeya et al., 2018). For instance, Cao et al. (2015) combined a CNN with handcrafted features to classify marine animals (Nanni et al., 2014b) (the Fish and MBARI benthic animal dataset (Edgington et al., 2006)). Salamon et al. (2017) investigated the use of fusing deep learning (using CNN) and shallow learning for the problem of bird species identification, based on 5428 bird flight calls from 43 species. In both these works, the fusion of CNNs with mode canonical techniques outperformed the single approach.

One of the main drawbacks of deep learning approaches is the need of great amount of training data (Marcus, 2018), in this case audio signals and consequently their visual representations. In case of limited amount of training images, data augmentation is a powerful tool. Animal sound datasets are usually much smaller than necessary, since the sample collection and labelling can be very expensive. Commonly, audio signals can be augmented in the time and/or in the frequency domains directly on the raw signals or after their conversion into spectrograms. In Lasseck (2018) different augmentation techniques were applied to the training set for the BirdCLEF 2018 initiative (www.imageclef.org/node/230) that included over 30,000 bird sound samples ranging over 1500 species. Bird audio signals were first augmented in the time domain by e.g. extracting chunks from random position in each file, applying jitter to duration, add two audio chunks from random files background noise and background atmospheric noise, applying random cyclic shift and time interval dropout. Every augmented audio chunk was then converted into spectrogram and then further augmented in the frequency domain by pitch shift and frequency stretch, piecewise time stretch and frequency stretch and applying color jittering. The influence of the complete augmentation led improve by almost 10% the identification performance quantified as Mean Reciprocal Rank. In the field of animal audio classification, Sprengel et al. (2016) used standard audio augmentation techniques for bird audio classification, such as time and pitch shift. Besides, they created more samples by summing two different samples belonging to the same class. This is motivated by the fact that the sound of two birds from the same class should still be correctly classified. Pandeya et al. (2018) demonstrated that audio signal augmentation by simple techniques as random selection of time stretching, pitch shifting, dynamic range compression, and insertion of noise on the domestic cat sound dataset, described in Section 5 of this paper, improved accuracy, F1-score and area under ROC curve. In particular, the performance improvement increased by including more augmented clones (one to three) per single original audio file. Conversely, Oikarinen et al. (2019), showed that augmenting their spectrograms by translations, adding random noise, and multiplying the input by a random value close to one, did not significantly improve their classification of marmoset audio signals. Of note, the aim of Oikarinen et al. was not the classification of species or call types only, e.g. from publicly available datasets, but the identification of call types and the source animal in a complex experimental setup consisting of multiple cages in one room, each cage containing two marmosets. Other techniques, inherited from e.g. speech recognition, are also suitable for animal sound classification. For instance, Jaitly and Hinton (2013) proposed Vocal Track Length Perturbation (VTLP), which alters the vocal tract length during the extraction of a descriptor to create a new sample. They show that this technique is very effective in speech recognition. Takahashi et al. (2016) used large convolutional networks with strong data augmentation to classify audio events. They also used VTLP and introduced a new transformation that consists in summing two different perturbed samples of the same class.

In this work, we compare different sets of data augmentation approaches, each coupled with different CNNs. This way, an ensemble of networks is trained. Finally, the set of classifiers is combined by sum rule. The proposed method is tested in two different audio classification dataset: the first related to domestic cat sound classification (Pandeya et al., 2018), the latter on bird classification (Zhao et al., 2017). Our experiments were designed to compare and maximize the performance obtained by varying combinations of data augmentation approaches and classifiers and they showed that our augmentation techniques were successful at improving the classification accuracy.

Our main contributions to the community are the following:

  • Different methods for audio data augmentation are tested/proposed/compared in two datasets;

  • Exhaustive tests are performed on the fusions among ensemble system based on CNNs trained with different data augmentation approaches;

  • All MATLAB source code used in our experiments will be freely available at https://github.com/LorisNanni

Section snippets

Audio image representation

In order to get image representations for the audio signals we applied a Discrete Gabor Transform (DGT) to the signal. The DGT is a particular case of Short-Time Fourier Transform where the window function is a Gaussian kernel. The continuous Gabor transform is defined as the convolution between a Gaussian and the product of the signal with a complex exponential:Gτω=1σ2+xteiωteπσ2tτ2dtwhere x(t) is the signal, ω is a frequency and i is the imaginary unit. The parameter σ2 is the width of

Convolutional neural networks

In this work, we used CNNs both for feature extraction (to train SVMs) and for direct classification. CNNs, introduced in 1998 by LeCun et al. (1998), are deep feed-forward neural networks where neurons are connected only locally to neurons from the previous layer. Weights, biases and activation functions are iteratively adjusted during the training phase. In addition to the input layer, i.e. the image or its part to be classified, and the output/classification (CLASS) layer, composed by one

Data augmentation approaches

In this paper, we tested the following four augmentation protocols. For the third and fourth protocols we used the methods provided Audiogmenter (Maguolo et al., 2019), an audio data augmentation library for MATLAB.

Experimental results

We assessed the effects of data augmentation using a stratified ten-fold cross validation protocol and the recognition rate as the performance indicator (i.e. the average accuracy over the different folds). We tested our approach on the following two datasets of animal audio recordings:

  • BIRDZ, the control and real-world audio dataset used in (Zhao et al., 2017). The real-world recordings were downloaded from the Xeno-canto Archive (http://www.xeno-canto.org/), selecting a set of 11 widespread

Conclusion

In this paper we explored how different data augmentation techniques improve the accuracy of automated audio classification of natural sounds (bird and cat sounds) by means of deep network.

Different types of data augmentation approaches for audio signals were proposed, tested and compared. Because of the nature of these signals, data augmentation methods were applied on both on the raw audio signals and on their visual representation as spectrogram. A set of CNNs was trained using different

Acknowledgment

The authors thank NVIDIA Corporation for supporting this work by donating Titan Xp GPU and the Tampere Center for Scientific Computing for generous computational resources.

References (54)

  • Y.M.G. Costa et al.

    Music genre recognition using spectrograms

  • Y. Costa et al.

    Music genre recognition using gabor filters and lpq texture descriptors

  • D.R. Edgington et al.

    Detecting, tracking and classifying animals in underwater video

  • K.M. Fristrup et al.

    Marine Animal Sound Classification

    (1993)
  • P. Hansen et al.

    Neural network ensembles

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1990)
  • R.M. Haralick

    Statistical and structural approaches to texture

    Proc. IEEE

    (1979)
  • E.J. Humphrey et al.

    Rethinking automatic chord recognition with convolutional neural networks

  • E.J. Humphrey et al.

    Moving beyond feature design: deep architectures and automatic feature learning in music informatics

    13th Int. Soc. Music Inf. Retr. Conf. ISMIR

    (2012)
  • N. Jaitly et al.

    Vocal tract length perturbation (VTLP) improves speech recognition

  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Commun. ACM

    (2012)
  • M. Lasseck

    Audio-based bird species identification with deep convolutional neural networks

  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • T. Lidy et al.

    Evaluation of feature extractors and psycho-acoustic transformations for music genre classification

  • G. Maguolo et al.

    Audiogmenter: a MATLAB toolbox for audio data augmentation

  • G. Marcus

    Deep learning: a critical appraisal

  • T. Nakashika et al.

    Local-feature-map integration using convolutional neural networks for music genre classification

  • L. Nanni et al.

    Set of Texture Descriptors for Music Genre Classification

    (2014)
  • Cited by (104)

    View all citing articles on Scopus
    View full text