Automatic analysis of pronunciations for children with speech sound disorders

doi:10.1016/j.csl.2017.12.006

Computer Speech & Language

Volume 50, July 2018, Pages 62-84

https://doi.org/10.1016/j.csl.2017.12.006 Get rights and content

Highlights

•
Propose two approaches for evaluating pronunciations for children with speechsound disorders.
•
GOP-Algorithm – an algorithm based approach.
•
GOP-SVM – a statistical learning based approach.
•
Prove that the proposed approaches outperform the current GOP state-of-the art approach using several different tests.
•
Describe the CCP corpus that contains data of correct and incorrect pronunciations that were essential for the training and testing processes of the proposed methods.

Abstract

Computer-Assisted Pronunciation Training (CAPT) systems aim to help a child learn the correct pronunciations of words. However, while there are many online commercial CAPT apps, there is no consensus among Speech Language Therapists (SLPs) or non-professionals about which CAPT systems, if any, work well. The prevailing assumption is that practicing with such programs is less reliable and thus does not provide the feedback necessary to allow children to improve their performance. The most common method for assessing pronunciation performance is the Goodness of Pronunciation (GOP) technique. Our paper proposes two new GOP techniques. We have found that pronunciation models that use explicit knowledge about error pronunciation patterns can lead to more accurate classification whether a phoneme was correctly pronounced or not. We evaluate the proposed pronunciation assessment methods against a baseline state of the art GOP approach, and show that the proposed techniques lead to classification performance that is more similar to that of a human expert.

Introduction

Phonological disorders are among the most prevalent communicative disabilities diagnosed in preschool and school-age children, accounting for 10% of this population (Gierut, 1998). The American Speech-Language Hearing Association determined that there is an observed relationship between early phonological disorders and subsequent reading, writing, spelling, and mathematical abilities (Castrogiovanni, 2008). Furthermore, speech production difficulties affect not only children’s communication and academic performance, but also their level of interaction with peers and adults. Considering the limited availability of speech language pathalogist (SLPs) (Squires, 2013), it is likely that a parent whose child was diagnosed with a phonological disorder would prefer to have their child practice and acquire language skills as quickly as possible, rather than relying solely on the limited time they can spend with a SLP.

Technology has provided one plausible solution to address the need of improving language skills. Advances in speech recognition technology in the early 90s (Huang et al., 1990) have attempted to address the problem of the lack of professional human resources to train children, specifically by developing computer assisted pronunciation (CAPT) systems (Witt and Young, 1997) that address children’s pronunciation disabilities. These pronunciation systems serve both as assistive tools for diagnosis as well as for practicing correct pronunciations (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Neri, Mich, Gerosa, Giuliani, 2008, Bunnell, Yarrington, Polikoff, 2000, Russell, Brown, Skilling, Series, Wallace, Bonham, Barker, 1996, Eskenazi, Hansma, 1998, Kawai, Hirose, September 22–25, 1997).

Unfortunately, many of the current technological tools are still limited. Despite progress in the field of speech processing, existing solutions have primarily concentrated on applying conventional automatic speech recognition (ASR) approaches to assess pronunciation in speech (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Ehsani, Knodt, 1998). However, conventional ASR approaches face problems in this area, as the speech of children is characterized by increased acoustic variability, while conventional ASR systems are trained to generate acoustic models from adult speech (Koenig, 2001). Moreover, some mispronunciation patterns may occur in children more frequently than in adults (Benzeghiba et al., 2007). As a result, while these automatic pronunciation systems hold promise, they are still not widely used since these technologies appear to be less reliable in terms of their performance (Ploog et al., 2013).

Other solutions have focused on explicitly modeling possible mispronunciations (Doremalen, Cucchiarini, Strik, 2013, Ronanki, Salsman, Li, 2012, Wei, Hu, Hu, Wang, 2009, Strik, Truong, De Wet, Cucchiarini, 2009); however, the state-of-the-art system using this approach has the limitation that every target pronunciation must be learned separately against its specific “competing” pronunciations (Doremalen et al., 2013) (i.e. mispronunciations). Therefore, the process of developing such a system requires substantial human supervision and input. In addition, the authors mention that in comparison to previous approaches, when introduced with low frequency pronunciation events, the system underperforms, as it requires learning of the specific features of every mispronunciation. The lack of a generalizable approach is a hurdle for automation, as is the need for a large corpus of training examples. An ideal system would incorporate knowledge of mispronunciations from a relatively small corpus while simultaneously require minimal human supervision and input.

Our research goals are derived from these limitations. The short-term objective of our research is to develop a method that will constitute the core component of an effective pronunciation analysis system for children aged 4–12 with speech sound disorders, enabling them to receive accurate feedback on their speech production, even in the absence of a clinician. The desired feedback addresses the question of whether children correctly pronounced a phoneme or not. In addition, to be effective, our system is designed to be highly automated. The long-term goal is to have such a system integrated into remediation techniques, complementing current therapy strategies. In this work, we build upon existing methodologies and extend them. Our main contributions are (1) developing an explicit model of a-priori pronunciation errors for children in the target age range, and (2) explicit modeling of the acoustics of distorted phonemes. We begin by investigating previous approaches in the field of automated pronunciation assessments. Next, we introduce a database containing mispronunciations of children with speech sound disorders. We then describe our proposed approach, and then apply a variety of different evaluation metrics to understand the strengths and weaknesses of our proposed methods. Finally, we present discussion, conclusions, and future work that is planned to further improve the current research.

Section snippets

Review of Goodness of Pronunciation (GOP)

In this section, we explore in detail several different algorithmic approaches to computing the Goodness of Pronunciation (GOP) technique, followed by an examination of machine learning-based approaches that both contributed to the field and are pertinent to the proposed methods described in Section 4.1.

The GOP technique, originally defined by Witt and Young (1997), has been used for pronunciation assessment and has evolved throughout the years in order to improve the quality of the decision

Data collection

The data that was predominantly used in the experiment came from the Corpus of Children’s Pronunciation (CCP). In order to collect the data, we recruited 86 children aged 4–12 ( $μ = 5.3,$ $σ = 1.3$ ). Co-occurrence of receptive and expressive language disorders is prevalent in children with speech production challenges, and so these children were screened to ensure that they had the ability to complete the tasks required in the study. The diagnosis of a speech-sound disorder was conferred by a licensed,

GOP-CI

Similar to our previous work (Dudy et al., 2015), we introduce the following algorithmic approach, which aims to improve the baseline GOP measure described in Eq. (7) of the LGOP approach by (1) learning acoustic models from a large children’s speech database aged 3–15 and then adapting to the speech of children in the final target age range of 4–11, (2) incorporating an explicit model of correct and incorrect pronunciations of the corpus described in the previous section, and (3) explicit

Raw results

In the first part of data analysis the raw results produced by the ASR are inspected. To visualize the recognition results by each of the elements of Eqs. (15)–(17) found in the GOP-CI and GOP-SVM approach, we extracted authentic examples from our data. In Fig. 9, there are four cases produced by our system. These cases show the segmentation of phonemes across the different elements; namely, the numerator (with the correct path) – “Num”, the correct-incorrect path “CI,” the Open Loop path “OL”,

Discussion and conclusions

In the current paper we proposed two automatic decision methods, GOP-CI and GOP-SVM, aimed at analyzing children’s speech for children who may be facing with speech-sound disorders. To evaluate the proposed methods, the most recent state of the art GOP – LGOP served as an automatic method of reference.

The first proposed method was the GOP-CI whose parameters were set using a grid search. The second proposed method was GOP-SVM which was trained using an SVM approach. Both methods learned

Future work

In order to create more robust models, one of our future plans is to improve the current state-of-the art for annotation scores. This could be done by adding an increased number of human expert annotators, producing a model that would learn a pattern according to a voting system and be able to reflect several experts decisions. However, this should take into account the level of disagreement rate among the annotators, since high levels of disagreement would produce an ineffective tool for

Acknowledgment

The presented research was supported by the National Institutes of Health (grant number R21DC012139).

References (42)

M. Benzeghiba et al.
Automatic speech recognition and speech variability: a review
Speech Commun.
(2007)
P. Swain et al.
The decision tree classifier: design and potential
IEEE Trans. Geosci. Electron.
(1977)
L. Weigelt et al.
Plosive/fricative distinction: the voiceless case
J. Acoust. Soc. Am.
(1990)
C. Allauzen et al.
OpenFst: a general and efficient weighted finite-state transducer library
Proceedings of the Ninth International Conference on Implementation and Application of Automata (CIAA 2007)
(2007)
D. Bates et al.
Fitting linear mixed-effects models using lme4
J. Stat. Softw.
(2015)
T. Bunnell et al.
STAR: articulation training for young children
Proceedings of the 2000 INTERSPEECH Conference
(2000)
A. Buse
The likelihood ratio, wald, and lagrange multiplier tests: an expository note
Am. Stat.
(1982)
A. Castrogiovanni
Incidence and Prevalence of Communication Disorders and Hearing Loss in Children
(2008)
C. Cortes et al.
Support vector machine
Mach. learn.
(1995)
V.J. Doremalen et al.
Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch
J. Acoust. Soc. Am.
(2013)

S. Dudy et al.

Pronunciation analysis for children with speech sound disorders

Proceedings of the IEEE Thirty-Seventh Annual International Conference of the Engineering in Medicine and Biology Society (EMBC, 2015)

(2015)

F. Ehsani et al.

Speech technology in computer-aided language learning: strengths and limitations of a new CALL paradigm

Lang. Learn. Technol.

(1998)

M. Eskenazi et al.

The FLUENCY pronunciation trainer

Proceedings of the 1998 STiLL Workshop

(1998)

J. Gierut

Treatment efficacy: functional phonological disorders in children

J. Speech Lang. Hear. Res.

(1998)

R. Goldman et al.

Goldman Fristoe test of articulation

American Guidance Service

(1986)

L. Graham et al.

Phonological oppositions in children: a perceptual study

J. Acoust. Soc. Am.

(1971)

HuangX. et al.

Hidden Markov Models for Speech Recognition

(1990)

G. Kawai et al.

A CALL system using speech recognition to train the pronunciation of Japanese long vowels, the mora nasal and mora obstruents

Proceedings of the 1997 EUROSPEECH , Rhodes, Greece

(September 22–25, 1997)

L. Koenig

Distributional characteristics of VOT in children’s voiceless aspirated stops and interpretation of developmental trends

J. Speech Lang. Hear. Res.

(2001)

M. Kutner et al.

Applied Linear Statistical Models

(2005)

B. Mak et al.

PLASER: pronunciation learning via automatic speech recognition

Proceedings of the HLT-NAACL Workshop on Building Educational Applications using Natural Language Processing

(2003)

Cited by (26)

Developing children's ASR system under low-resource conditions using end-to-end architecture
2024, Digital Signal Processing: A Review Journal
The work presented in this paper aims at enhancing the performance of end-to-end (E2E) speech recognition task for children's speech under low resource conditions. For majority of the languages, there is hardly any speech data from child speakers. Furthermore, even the available children's speech corpora are limited in terms of the number of hours of data. On the other hand, large amounts of adults' speech data are freely available for research as well as commercial purposes. As a consequence, developing an effective E2E automatic speech recognition (ASR) system for children becomes a very challenging task. One may develop an ASR system using adults' speech and then use it to transcribe children's data, but this leads to very poor recognition rates due to the stark differences in the acoustic attributes of adults' and children's speech. In order to overcome these hurdles and to develop a robust children's ASR system employing E2E architecture, we have resorted to several out-of-domain and in-domain data augmentation techniques. For out-of-domain data augmentation, we have explicitly modified adults' speech to render it acoustically similar to that of children's speech before pooling into training. On the other hand, in the case of in-domain data augmentation, we have slightly modified the pitch and duration of children's speech in order to create more data capturing greater diversity. Data augmentation approaches helps in mitigating the ill-effects resulting from the scarcity of data from child domain to a certain extent. This, in turn, reduces the error rates by a large margin. In addition to data augmentation, we have also studied the efficacy of Gamma-tone frequency cepstral coefficients (GFCC) and frequency domain linear prediction (FDLP) technique along with the most commonly used Mel-frequency cepstral coefficients (MFCC) for front-end speech parameterization. Both MFCC as well as GFCC capture and model the spectral envelope of speech. On the other hand, application of linear prediction on the frequency domain representation of speech signal helps to effectively capture the temporal envelope during front-end feature extraction. Employing FDLP features that model the temporal envelope provides important cues for the perception and understanding of stop bursts and, at times, complete phonemes. This motivated us to perform a comparative experimental study of the effectiveness of the three aforementioned front-end acoustic features. In our experimental explorations, the use of proposed data augmentation in combination of FDLP features has shown a relative improvement in character error rate by 67.6% over the baseline system. The combination of data augmentation with MFCC or GFCC features is observed to result in lower recognition performances.
Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors
2021, Speech Communication
Citation Excerpt :
Other sources, if available, may consist of in-domain speech from the learners’ native language or disordered speech. In-domain data can be used to develop extended search lattices accepting non-canonical pronunciation alternatives (Harrison et al., 2009; Ward et al., 2016; Dudy et al., 2018). The trained acoustic models and extended transducers then decode unseen utterances for which the text is known.
Speech sound disorders are a common communication impairment in childhood. Because speech disorders can negatively affect the lives and the development of children, clinical intervention is often recommended. To help with diagnosis and treatment, clinicians use instrumented methods such as spectrograms or ultrasound tongue imaging to analyse speech articulations. Analysis with these methods can be laborious for clinicians, therefore there is growing interest in its automation. In this paper, we investigate the contribution of ultrasound tongue imaging for the automatic detection of speech articulation errors. Our systems are trained on typically developing child speech and augmented with a database of adult speech using audio and ultrasound. Evaluation on typically developing speech indicates that pre-training on adult speech and jointly using ultrasound and audio gives the best results with an accuracy of 86.9%. To evaluate on disordered speech, we collect pronunciation scores from experienced speech and language therapists, focusing on cases of velar fronting and gliding of /r/. The scores show good inter-annotator agreement for velar fronting, but not for gliding errors. For automatic velar fronting error detection, the best results are obtained when jointly using ultrasound and audio. The best system correctly detects 86.6% of the errors identified by experienced clinicians. Out of all the segments identified as errors by the best system, 73.2% match errors identified by clinicians. Results on automatic gliding detection are harder to interpret due to poor inter-annotator agreement, but appear promising. Overall findings suggest that automatic detection of speech articulation errors has potential to be integrated into ultrasound intervention software for automatically quantifying progress during speech therapy.
AUTOMATIC SPEECH RECOGNITION (ASR) FOR THE DIAGNOSIS OF PRONUNCIATION OF SPEECH SOUND DISORDERS IN KOREAN CHILDREN
2024, arXiv
Goodness of Pronunciation Algorithm in the Speech Analysis and Assessment for Detecting Errors in Acoustic Phonetics: An exploratory review
2023, TechRxiv
Multimodal Speech Training for the Hard of Hearing in Mandarine
2023, ROCLING 2023 - Proceedings of the 35th Conference on Computational Linguistics and Speech Processing
An Analysis of Goodness of Pronunciation for Child Speech
2023, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

View all citing articles on Scopus

View full text

Automatic analysis of pronunciations for children with speech sound disorders

Highlights

Abstract

Introduction

Section snippets

Review of Goodness of Pronunciation (GOP)

Data collection

GOP-CI

Raw results

Discussion and conclusions

Future work

Acknowledgment

Speech Commun.

IEEE Trans. Geosci. Electron.

J. Acoust. Soc. Am.

OpenFst: a general and efficient weighted finite-state transducer library

Proceedings of the Ninth International Conference on Implementation and Application of Automata (CIAA 2007)

Fitting linear mixed-effects models using lme4

J. Stat. Softw.

STAR: articulation training for young children

Proceedings of the 2000 INTERSPEECH Conference

The likelihood ratio, wald, and lagrange multiplier tests: an expository note

Am. Stat.

Incidence and Prevalence of Communication Disorders and Hearing Loss in Children

Support vector machine

Mach. learn.

Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch

J. Acoust. Soc. Am.

Pronunciation analysis for children with speech sound disorders

Proceedings of the IEEE Thirty-Seventh Annual International Conference of the Engineering in Medicine and Biology Society (EMBC, 2015)

Speech technology in computer-aided language learning: strengths and limitations of a new CALL paradigm

Lang. Learn. Technol.

The FLUENCY pronunciation trainer

Proceedings of the 1998 STiLL Workshop

Treatment efficacy: functional phonological disorders in children

J. Speech Lang. Hear. Res.

Goldman Fristoe test of articulation

American Guidance Service

Phonological oppositions in children: a perceptual study

J. Acoust. Soc. Am.

Hidden Markov Models for Speech Recognition

A CALL system using speech recognition to train the pronunciation of Japanese long vowels, the mora nasal and mora obstruents

Proceedings of the 1997 EUROSPEECH , Rhodes, Greece

Distributional characteristics of VOT in children’s voiceless aspirated stops and interpretation of developmental trends

J. Speech Lang. Hear. Res.

Applied Linear Statistical Models

PLASER: pronunciation learning via automatic speech recognition

Proceedings of the HLT-NAACL Workshop on Building Educational Applications using Natural Language Processing