Elsevier

Pattern Recognition Letters

Volume 135, July 2020, Pages 299-306
Pattern Recognition Letters

Phoneme classification in reconstructed phase space with convolutional neural networks

https://doi.org/10.1016/j.patrec.2020.05.002Get rights and content

Highlights

  • We embed speech signals in a geometric phase space for analysis.

  • We propose the use of convolutional neural networks for quantifying the phase space.

  • We perform density normalization in 2D phase space to account for trajectory overlaps.

  • RPS-CNN method performs better than prior techniques on phoneme classification tasks.

  • The method is generic for use in applications other than speech.

Abstract

In this paper, we analyse segmented speech phonemes with Convolutional filters, after embedding them in Reconstructed Phase Space (RPS). These feature extracting Convolutional filters are trained on the embedded speech data from scratch and are also fine-tuned from networks trained with other data. Reconstruction of Phase Space portrays the dynamics of an observed system as a geometric representation. We present a study highlighting the discriminative capacity of the features extracted through Convolutional Neural Network (CNN) from the textural pattern and shape of this geometric representation. CNNs are heavily used in image-related tasks, but have not seen application on phase space portraits, possibly due to the higher dimensionality of the embedding. However, we find that the application of CNN on restricted bi-dimensional RPS, characterizes the space well than prior methods on high dimensional embeddings. We show experimental results supporting the use of RPS with CNN (RPS-CNN) for phoneme classification. The results affirm that essential signal characteristics are automatically quantified from the phase portraits of speech and can be used in place of conventional techniques involving frequency domain transformations.

Introduction

Phase Space representations provide a geometric perspective of the underlying dynamics of a system. These representations are extensively used in the behavioral study of dynamical systems that are defined by a set of Ordinary Differential Equations (ODE). Reconstructions of Phase Space however, are required when modeling is restricted whereby a system could not be defined in terms of ODE or where the complex underlying mechanism could not be directly measured. This is practically the case for most dynamical systems where only the external observation of the complex interactions of the system is available. Speech is observed as the output of a complex, non-linear dynamical system. These phase space representations also known as phase portraits are likely to be one of node, spiral, center or saddle in shape (see Fig. 1), but dynamical systems that are chaotic have unusual shapes. The phase portraits of these chaotic dynamical systems are also referred to as strange attractors, due to the unpredictability of the path of the trajectory. As a time domain method, Phase Portraits are an important tool in non-linear approaches for speech processing. Speech is also known to contain elements of chaos [4]. The vocal tract system operates in different configurations for different sounds, and thus distinct phase portraits can be observed for the different sound units. Phase portraits can be used as symbols for comparison, characterising the underlying system configuration [3]. However, a phase portrait does not depict qualitative information of the speech signal such as fundamental frequency, pitch, etc. like a spectrogram does. Such characterisations of phase space were attempted earlier for speech with measures such as Fractal Dimensions (FD), Lyapunov Exponents (LE), Entropy (K), etc. [8], [13]. Similar measures are also used in other works utilizing RPS, for example, in identification of arrhythmia from ECG signals [15]. While many of these measures aimed at quantifying some property of the phase portraits such as recurrence or irregularity in the path of the trajectory, etc. they do not extract all information relevant for a specific application. We therefore use filter kernels of a Convolutional Neural Network (CNN) to extract relevant information suited for a classification task. Further we also preprocess the Phase space representation to reduce the effects of restricted bi-dimensional embedding.

Section snippets

Reconstruction of phase space

The phase space of a dynamical system is reconstructed from a single dimensional observation following a procedure described below, known as the delay-coordinate embedding or the time-delay embedding. Let s(n) be a time-series signal sampled at discrete intervals, then the delay co-ordinate embedding se(n), with a delay τ and dimension m as parameters is given by,se(n)=[s(0)s(1)s(nmτ)s(0+τ)s(1+τ)s(n(m1)τ)s(0+mτ)s(1+mτ)s(n)]

The state of the dynamical system at any instant t can be

Problem specification

The reconstructed phase portrait is a geometric structure in m dimensions. Measures such as Fractal Dimensions, Maximum Lyapunov Exponents (MLE) quantify the phase space, characterising the dynamical system based on some properties visible in the trajectory. However, they provide only a specialized qualitative description of the RPS and are not comprehensive. Thus we have the problem of numerically quantifying distribution of points in an m dimensional geometric space to a feature vector that

Methods and data

Convolutional Neural Networks provide state-of-the-art performance on visual and geometric data in many tasks. Here, feature detectors are learnt automatically from training data, without requiring any definition of shape or texture descriptors. Sequentially stacked Convolutional layers form a hierarchical combination of simple detectors from lower levels to complex ones in the higher layers effectively analysing texture and other patterns. However, the input requirements of the CNN restrict

Experiments

In speech processing and subsequent recognition, contextual knowledge including language information is used for smoothing the prediction of the identification system [23]. The accuracy of the full speech recognition system does not help in analysing front-end signal processing techniques effectively. Therefore, like similar experiments on feature analysis, we evaluate our methods on isolated phoneme classification task without any use of contextual information. Accurate classification of

Discussion

The performance of RPS-CNN method is compared with other RPS and non-RPS methods in experiments described above. Apart from the classification accuracies, the internal workings of the CNN can be analysed by extracting the features from a hidden layer of the network. The weights are extracted from a fully connected layer of the CNN, and are then embedded in two dimensions with the t-SNE algorithm and is visualised in a flat non-overlapping surface, as shown in Fig. 11. The clear separation

Conclusion

In this paper, we characterize and classify speech phonemes with CNN after reconstructing them in Phase Space by delay-coordinate embedding. The speech phonemes are processed visually from the geometric space of the embedding. Reconstructed Phase Space processing of speech signals exploits production related non-linearities. We notice better discrimination among liquids and fricatives in comparison with other phoneme categories. From experimental observations, RPS-CNN method performed

Declaration of Competing Interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in

References (23)

  • M.T. Johnson et al.

    Time-domain isolated phoneme classification using reconstructed phase spaces

    IEEE Trans. Speech Audio Process.

    (2005)
  • Cited by (8)

    • “A new feature-based time series classification method by using scale-space extrema”

      2021, Engineering Science and Technology, an International Journal
      Citation Excerpt :

      Time series classification (TSC) can be defined as a supervised learning task that involves building a model based on pre-labeled time series classes and then use this model to assign new instances of time series to those predefined classes. TSC has a lot of real-world applications, such as classifying normal and abnormal brain activities or eye state identification through electroencephalogram (EEG) data analysis [2,3], Electrocardiogram (ECG) data analysis [4,5], classification of phonemes [6,7], and identification of tendencies in stocks market [8,9]. Besides, domain-independent TSC studies, similar to these ones have been becoming more popular nowadays [10–13].

    View all citing articles on Scopus

    Editor: Emmanouil Benetos

    View full text