Phoneme classification in reconstructed phase space with convolutional neural networks☆
Introduction
Phase Space representations provide a geometric perspective of the underlying dynamics of a system. These representations are extensively used in the behavioral study of dynamical systems that are defined by a set of Ordinary Differential Equations (ODE). Reconstructions of Phase Space however, are required when modeling is restricted whereby a system could not be defined in terms of ODE or where the complex underlying mechanism could not be directly measured. This is practically the case for most dynamical systems where only the external observation of the complex interactions of the system is available. Speech is observed as the output of a complex, non-linear dynamical system. These phase space representations also known as phase portraits are likely to be one of node, spiral, center or saddle in shape (see Fig. 1), but dynamical systems that are chaotic have unusual shapes. The phase portraits of these chaotic dynamical systems are also referred to as strange attractors, due to the unpredictability of the path of the trajectory. As a time domain method, Phase Portraits are an important tool in non-linear approaches for speech processing. Speech is also known to contain elements of chaos [4]. The vocal tract system operates in different configurations for different sounds, and thus distinct phase portraits can be observed for the different sound units. Phase portraits can be used as symbols for comparison, characterising the underlying system configuration [3]. However, a phase portrait does not depict qualitative information of the speech signal such as fundamental frequency, pitch, etc. like a spectrogram does. Such characterisations of phase space were attempted earlier for speech with measures such as Fractal Dimensions (FD), Lyapunov Exponents (LE), Entropy (K), etc. [8], [13]. Similar measures are also used in other works utilizing RPS, for example, in identification of arrhythmia from ECG signals [15]. While many of these measures aimed at quantifying some property of the phase portraits such as recurrence or irregularity in the path of the trajectory, etc. they do not extract all information relevant for a specific application. We therefore use filter kernels of a Convolutional Neural Network (CNN) to extract relevant information suited for a classification task. Further we also preprocess the Phase space representation to reduce the effects of restricted bi-dimensional embedding.
Section snippets
Reconstruction of phase space
The phase space of a dynamical system is reconstructed from a single dimensional observation following a procedure described below, known as the delay-coordinate embedding or the time-delay embedding. Let s(n) be a time-series signal sampled at discrete intervals, then the delay co-ordinate embedding se(n), with a delay τ and dimension m as parameters is given by,
The state of the dynamical system at any instant t can be
Problem specification
The reconstructed phase portrait is a geometric structure in m dimensions. Measures such as Fractal Dimensions, Maximum Lyapunov Exponents (MLE) quantify the phase space, characterising the dynamical system based on some properties visible in the trajectory. However, they provide only a specialized qualitative description of the RPS and are not comprehensive. Thus we have the problem of numerically quantifying distribution of points in an m dimensional geometric space to a feature vector that
Methods and data
Convolutional Neural Networks provide state-of-the-art performance on visual and geometric data in many tasks. Here, feature detectors are learnt automatically from training data, without requiring any definition of shape or texture descriptors. Sequentially stacked Convolutional layers form a hierarchical combination of simple detectors from lower levels to complex ones in the higher layers effectively analysing texture and other patterns. However, the input requirements of the CNN restrict
Experiments
In speech processing and subsequent recognition, contextual knowledge including language information is used for smoothing the prediction of the identification system [23]. The accuracy of the full speech recognition system does not help in analysing front-end signal processing techniques effectively. Therefore, like similar experiments on feature analysis, we evaluate our methods on isolated phoneme classification task without any use of contextual information. Accurate classification of
Discussion
The performance of RPS-CNN method is compared with other RPS and non-RPS methods in experiments described above. Apart from the classification accuracies, the internal workings of the CNN can be analysed by extracting the features from a hidden layer of the network. The weights are extracted from a fully connected layer of the CNN, and are then embedded in two dimensions with the t-SNE algorithm and is visualised in a flat non-overlapping surface, as shown in Fig. 11. The clear separation
Conclusion
In this paper, we characterize and classify speech phonemes with CNN after reconstructing them in Phase Space by delay-coordinate embedding. The speech phonemes are processed visually from the geometric space of the embedding. Reconstructed Phase Space processing of speech signals exploits production related non-linearities. We notice better discrimination among liquids and fricatives in comparison with other phoneme categories. From experimental observations, RPS-CNN method performed
Declaration of Competing Interest
The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in
References (23)
- et al.
Analysis and classification of speech signals by generalized fractal dimension features
Speech Commun.
(2009) - et al.
Experiments in speech recognition using a modular MLP architecture for acoustic modelling
Inf. Sci.
(2003) - et al.
SR-NBS: a fast sparse representation based N-best class selector for robust phoneme classification
Eng. Appl. Artif. Intell.
(2014) - et al.
Phoneme classification using the auditory neurogram.
IEEE Access
(2017) - et al.
Phoneme recognition on the TIMIT database, Lopes, C. and Perdigao, F.
Speech Technol.
(2011) Attractor comparisons based on density
Chaos
(2015)- et al.
Some notes on nonlinearities of speech
Nonlinear Speech Modeling and Applications
(2005) - et al.
Independent coordinates for strange attractors from mutual information
Phys. Rev. A
(1986) TIMIT acoustic phonetic continuous speech corpus
Linguist. Data Consort.
(1993)- et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Time-domain isolated phoneme classification using reconstructed phase spaces
IEEE Trans. Speech Audio Process.
Cited by (8)
Feature extraction based on time-series topological analysis for the partial discharge pattern recognition of high-voltage power cables
2023, Measurement: Journal of the International Measurement Confederation“A new feature-based time series classification method by using scale-space extrema”
2021, Engineering Science and Technology, an International JournalCitation Excerpt :Time series classification (TSC) can be defined as a supervised learning task that involves building a model based on pre-labeled time series classes and then use this model to assign new instances of time series to those predefined classes. TSC has a lot of real-world applications, such as classifying normal and abnormal brain activities or eye state identification through electroencephalogram (EEG) data analysis [2,3], Electrocardiogram (ECG) data analysis [4,5], classification of phonemes [6,7], and identification of tendencies in stocks market [8,9]. Besides, domain-independent TSC studies, similar to these ones have been becoming more popular nowadays [10–13].
Improvement of automatic speech recognition systems utilizing 2D adaptive wavelet transformation applied to recurrence plot of speech trajectories
2024, Signal, Image and Video ProcessingApplying AR Technology Integrating Unity3D with the Vuforia SDK for Oral English Teaching
2023, IEIE Transactions on Smart Processing and ComputingAssessment of the Clusterability of Data Using a Multimodal Convolutional Neural Network
2022, IEEE Transactions on Artificial Intelligence
- ☆
Editor: Emmanouil Benetos