Phoneme classification in reconstructed phase space with convolutional neural networks

doi:10.1016/j.patrec.2020.05.002

Pattern Recognition Letters

Volume 135, July 2020, Pages 299-306

https://doi.org/10.1016/j.patrec.2020.05.002 Get rights and content

Highlights

•
We embed speech signals in a geometric phase space for analysis.
•
We propose the use of convolutional neural networks for quantifying the phase space.
•
We perform density normalization in 2D phase space to account for trajectory overlaps.
•
RPS-CNN method performs better than prior techniques on phoneme classification tasks.
•
The method is generic for use in applications other than speech.

Abstract

In this paper, we analyse segmented speech phonemes with Convolutional filters, after embedding them in Reconstructed Phase Space (RPS). These feature extracting Convolutional filters are trained on the embedded speech data from scratch and are also fine-tuned from networks trained with other data. Reconstruction of Phase Space portrays the dynamics of an observed system as a geometric representation. We present a study highlighting the discriminative capacity of the features extracted through Convolutional Neural Network (CNN) from the textural pattern and shape of this geometric representation. CNNs are heavily used in image-related tasks, but have not seen application on phase space portraits, possibly due to the higher dimensionality of the embedding. However, we find that the application of CNN on restricted bi-dimensional RPS, characterizes the space well than prior methods on high dimensional embeddings. We show experimental results supporting the use of RPS with CNN (RPS-CNN) for phoneme classification. The results affirm that essential signal characteristics are automatically quantified from the phase portraits of speech and can be used in place of conventional techniques involving frequency domain transformations.

Introduction

Phase Space representations provide a geometric perspective of the underlying dynamics of a system. These representations are extensively used in the behavioral study of dynamical systems that are defined by a set of Ordinary Differential Equations (ODE). Reconstructions of Phase Space however, are required when modeling is restricted whereby a system could not be defined in terms of ODE or where the complex underlying mechanism could not be directly measured. This is practically the case for most dynamical systems where only the external observation of the complex interactions of the system is available. Speech is observed as the output of a complex, non-linear dynamical system. These phase space representations also known as phase portraits are likely to be one of node, spiral, center or saddle in shape (see Fig. 1), but dynamical systems that are chaotic have unusual shapes. The phase portraits of these chaotic dynamical systems are also referred to as strange attractors, due to the unpredictability of the path of the trajectory. As a time domain method, Phase Portraits are an important tool in non-linear approaches for speech processing. Speech is also known to contain elements of chaos [4]. The vocal tract system operates in different configurations for different sounds, and thus distinct phase portraits can be observed for the different sound units. Phase portraits can be used as symbols for comparison, characterising the underlying system configuration [3]. However, a phase portrait does not depict qualitative information of the speech signal such as fundamental frequency, pitch, etc. like a spectrogram does. Such characterisations of phase space were attempted earlier for speech with measures such as Fractal Dimensions (FD), Lyapunov Exponents (LE), Entropy (K), etc. [8], [13]. Similar measures are also used in other works utilizing RPS, for example, in identification of arrhythmia from ECG signals [15]. While many of these measures aimed at quantifying some property of the phase portraits such as recurrence or irregularity in the path of the trajectory, etc. they do not extract all information relevant for a specific application. We therefore use filter kernels of a Convolutional Neural Network (CNN) to extract relevant information suited for a classification task. Further we also preprocess the Phase space representation to reduce the effects of restricted bi-dimensional embedding.

Section snippets

Reconstruction of phase space

The phase space of a dynamical system is reconstructed from a single dimensional observation following a procedure described below, known as the delay-coordinate embedding or the time-delay embedding. Let s(n) be a time-series signal sampled at discrete intervals, then the delay co-ordinate embedding s_e(n), with a delay τ and dimension m as parameters is given by, $\begin{matrix} s_{e} (n) & = [\begin{matrix} s (0) & s (1) & \dots & s (n - m τ) \\ s (0 + τ) & s (1 + τ) & \dots & s (n - (m - 1) τ) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ s (0 + m τ) & s (1 + m τ) & \dots & s (n) \end{matrix}] \end{matrix}$

The state of the dynamical system at any instant t can be

Problem specification

The reconstructed phase portrait is a geometric structure in m dimensions. Measures such as Fractal Dimensions, Maximum Lyapunov Exponents (MLE) quantify the phase space, characterising the dynamical system based on some properties visible in the trajectory. However, they provide only a specialized qualitative description of the RPS and are not comprehensive. Thus we have the problem of numerically quantifying distribution of points in an m dimensional geometric space to a feature vector that

Methods and data

Convolutional Neural Networks provide state-of-the-art performance on visual and geometric data in many tasks. Here, feature detectors are learnt automatically from training data, without requiring any definition of shape or texture descriptors. Sequentially stacked Convolutional layers form a hierarchical combination of simple detectors from lower levels to complex ones in the higher layers effectively analysing texture and other patterns. However, the input requirements of the CNN restrict

Experiments

In speech processing and subsequent recognition, contextual knowledge including language information is used for smoothing the prediction of the identification system [23]. The accuracy of the full speech recognition system does not help in analysing front-end signal processing techniques effectively. Therefore, like similar experiments on feature analysis, we evaluate our methods on isolated phoneme classification task without any use of contextual information. Accurate classification of

Discussion

The performance of RPS-CNN method is compared with other RPS and non-RPS methods in experiments described above. Apart from the classification accuracies, the internal workings of the CNN can be analysed by extracting the features from a hidden layer of the network. The weights are extracted from a fully connected layer of the CNN, and are then embedded in two dimensions with the t-SNE algorithm and is visualised in a flat non-overlapping surface, as shown in Fig. 11. The clear separation

Conclusion

In this paper, we characterize and classify speech phonemes with CNN after reconstructing them in Phase Space by delay-coordinate embedding. The speech phonemes are processed visually from the geometric space of the embedding. Reconstructed Phase Space processing of speech signals exploits production related non-linearities. We notice better discrimination among liquids and fricatives in comparison with other phoneme categories. From experimental observations, RPS-CNN method performed

Declaration of Competing Interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in

References (23)

V. Pitsikalis et al.
Analysis and classification of speech signals by generalized fractal dimension features
Speech Commun.
(2009)
T.J. Reynolds et al.
Experiments in speech recognition using a modular MLP architecture for acoustic modelling
Inf. Sci.
(2003)
A. Saeb et al.
SR-NBS: a fast sparse representation based N-best class selector for robust phoneme classification
Eng. Appl. Artif. Intell.
(2014)
M.S. Alam et al.
Phoneme classification using the auditory neurogram.
IEEE Access
(2017)
L. Carla et al.
Phoneme recognition on the TIMIT database, Lopes, C. and Perdigao, F.
Speech Technol.
(2011)
T. Carroll
Attractor comparisons based on density
Chaos
(2015)
A. Esposito et al.
Some notes on nonlinearities of speech
Nonlinear Speech Modeling and Applications
(2005)
A.M. Fraser et al.
Independent coordinates for strange attractors from mutual information
Phys. Rev. A
(1986)
J.S. Garofolo
TIMIT acoustic phonetic continuous speech corpus
Linguist. Data Consort.
(1993)
K. He et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)

M.T. Johnson et al.

Time-domain isolated phoneme classification using reconstructed phase spaces

IEEE Trans. Speech Audio Process.

(2005)

Cited by (8)

What can phone attractors in RPS tell us? A study of dynamic information in speech signals for phone classification purposes
2023, Applied Acoustics
The speech production system is time-varying, multidimensional, and nonlinear. Most techniques for spoken feature extraction (SFE), which are tools for extracting information from speech signals, rely on the linear aspects of this system. In the past two decades, several techniques have been developed to account for the nonlinear characteristics of the system using embedded speech attractors in the reconstructed phase space (RPS). However, despite the clear benefits of speech representation in the RPS domain, only a few studies have successfully applied it for classification purposes. The main goal of this study is to develop an RPS-based framework that uses dynamic information of the embedded speech attractors in the RPS domain and outperforms the time-domain SFE techniques. The extracted features are based on multivariate linear prediction models of phone trajectories that show the dynamic information of the embedded speech attractor in the RPS. Several experiments on the FARSDAT and TIMIT databases test the phone classification accuracy of the proposed framework and show that the dynamic information of the phone attractors can significantly improve phone classification accuracy.
Feature extraction based on time-series topological analysis for the partial discharge pattern recognition of high-voltage power cables
2023, Measurement: Journal of the International Measurement Confederation
In the partial discharge (PD) pattern recognition of power cables, the existing time–frequency features often exert an impact on recognition accuracy because of insufficient discrimination. A novel PD feature extraction and identification method on the basis of time-series topological data analysis (TDA) was proposed in this paper. Firstly, original PD sequence was reconstructed as point cloud in phase space based on optimized symbolic entropy. Then, a PD topological space is constituted with point cloud to extract its persistent homology features. On this basis, persistence diagrams and barcodes were calculated and visually expressed as Betty curves. Finally, Betty curves were input into an optimized 1D convolution neural network (1D-CNN) model to recognize four typical PD patterns and carry out comparison experiments. The visualization produced by t-distributed stochastic neighbor embedding (t-SNE) shows that TDA features possess significant discrimination, experiencing an increase of 11.25% in the overall recognition accuracy and reaching 98.00% compared with original PD sequence and time–frequency features. Meanwhile, the computation cost of the proposed algorithm is optimized within the permissible range for real-time applications.
“A new feature-based time series classification method by using scale-space extrema”
2021, Engineering Science and Technology, an International Journal
Citation Excerpt :
Time series classification (TSC) can be defined as a supervised learning task that involves building a model based on pre-labeled time series classes and then use this model to assign new instances of time series to those predefined classes. TSC has a lot of real-world applications, such as classifying normal and abnormal brain activities or eye state identification through electroencephalogram (EEG) data analysis [2,3], Electrocardiogram (ECG) data analysis [4,5], classification of phonemes [6,7], and identification of tendencies in stocks market [8,9]. Besides, domain-independent TSC studies, similar to these ones have been becoming more popular nowadays [10–13].
Time series data mining has received significant attention over the past decade, and many approaches have focused on classification tasks where the goal is to define the label of a test time series, given labeled training data. Time series classification approaches can be broadly grouped into two categories as instance-based and feature-based methods. Instance-based approaches utilize similarity information in a nearest-neighbor setting to classify time series data. Although approaches from this category provide accurate results, their performance degrades with long and noisy time series. On the other hand, feature-based approaches extract features to deal with the limitations of instance-based approaches; however, these approaches work with predefined features and may not be successful in certain classification problems. This study proposes a time series classification approach that benefits from both scale-space theory and bag-of-features technique. The method starts with finding the scale-space extrema points (i.e. key points) of each time series according to the SiZer (SIgnificant ZERo crossings of the derivatives) method, and then proceeds to create the local features set around these points. After extraction of the local features from each key point, a bag-of-features representation for each time series is constructed as the summary of the key point characteristics. We evaluate the success of the proposed representation on time series classification problems from various domains. Our experimental results show that our proposal provides competitive results compared to widely used approaches in the literature.
Improvement of automatic speech recognition systems utilizing 2D adaptive wavelet transformation applied to recurrence plot of speech trajectories
2024, Signal, Image and Video Processing
Applying AR Technology Integrating Unity3D with the Vuforia SDK for Oral English Teaching
2023, IEIE Transactions on Smart Processing and Computing
Assessment of the Clusterability of Data Using a Multimodal Convolutional Neural Network
2022, IEEE Transactions on Artificial Intelligence

View all citing articles on Scopus

^☆: Editor: Emmanouil Benetos

View full text

Phoneme classification in reconstructed phase space with convolutional neural networks☆

Highlights

Abstract

Introduction

Section snippets

Reconstruction of phase space

Problem specification

Methods and data

Experiments

Discussion

Conclusion

Declaration of Competing Interest

Speech Commun.

Inf. Sci.

Eng. Appl. Artif. Intell.

Phoneme classification using the auditory neurogram.

IEEE Access

Phoneme recognition on the TIMIT database, Lopes, C. and Perdigao, F.

Speech Technol.

Attractor comparisons based on density

Chaos

Some notes on nonlinearities of speech

Nonlinear Speech Modeling and Applications

Independent coordinates for strange attractors from mutual information

Phys. Rev. A

TIMIT acoustic phonetic continuous speech corpus

Linguist. Data Consort.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Time-domain isolated phoneme classification using reconstructed phase spaces

IEEE Trans. Speech Audio Process.