Automatic analysis of pronunciations for children with speech sound disorders
Introduction
Phonological disorders are among the most prevalent communicative disabilities diagnosed in preschool and school-age children, accounting for 10% of this population (Gierut, 1998). The American Speech-Language Hearing Association determined that there is an observed relationship between early phonological disorders and subsequent reading, writing, spelling, and mathematical abilities (Castrogiovanni, 2008). Furthermore, speech production difficulties affect not only children’s communication and academic performance, but also their level of interaction with peers and adults. Considering the limited availability of speech language pathalogist (SLPs) (Squires, 2013), it is likely that a parent whose child was diagnosed with a phonological disorder would prefer to have their child practice and acquire language skills as quickly as possible, rather than relying solely on the limited time they can spend with a SLP.
Technology has provided one plausible solution to address the need of improving language skills. Advances in speech recognition technology in the early 90s (Huang et al., 1990) have attempted to address the problem of the lack of professional human resources to train children, specifically by developing computer assisted pronunciation (CAPT) systems (Witt and Young, 1997) that address children’s pronunciation disabilities. These pronunciation systems serve both as assistive tools for diagnosis as well as for practicing correct pronunciations (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Neri, Mich, Gerosa, Giuliani, 2008, Bunnell, Yarrington, Polikoff, 2000, Russell, Brown, Skilling, Series, Wallace, Bonham, Barker, 1996, Eskenazi, Hansma, 1998, Kawai, Hirose, September 22–25, 1997).
Unfortunately, many of the current technological tools are still limited. Despite progress in the field of speech processing, existing solutions have primarily concentrated on applying conventional automatic speech recognition (ASR) approaches to assess pronunciation in speech (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Ehsani, Knodt, 1998). However, conventional ASR approaches face problems in this area, as the speech of children is characterized by increased acoustic variability, while conventional ASR systems are trained to generate acoustic models from adult speech (Koenig, 2001). Moreover, some mispronunciation patterns may occur in children more frequently than in adults (Benzeghiba et al., 2007). As a result, while these automatic pronunciation systems hold promise, they are still not widely used since these technologies appear to be less reliable in terms of their performance (Ploog et al., 2013).
Other solutions have focused on explicitly modeling possible mispronunciations (Doremalen, Cucchiarini, Strik, 2013, Ronanki, Salsman, Li, 2012, Wei, Hu, Hu, Wang, 2009, Strik, Truong, De Wet, Cucchiarini, 2009); however, the state-of-the-art system using this approach has the limitation that every target pronunciation must be learned separately against its specific “competing” pronunciations (Doremalen et al., 2013) (i.e. mispronunciations). Therefore, the process of developing such a system requires substantial human supervision and input. In addition, the authors mention that in comparison to previous approaches, when introduced with low frequency pronunciation events, the system underperforms, as it requires learning of the specific features of every mispronunciation. The lack of a generalizable approach is a hurdle for automation, as is the need for a large corpus of training examples. An ideal system would incorporate knowledge of mispronunciations from a relatively small corpus while simultaneously require minimal human supervision and input.
Our research goals are derived from these limitations. The short-term objective of our research is to develop a method that will constitute the core component of an effective pronunciation analysis system for children aged 4–12 with speech sound disorders, enabling them to receive accurate feedback on their speech production, even in the absence of a clinician. The desired feedback addresses the question of whether children correctly pronounced a phoneme or not. In addition, to be effective, our system is designed to be highly automated. The long-term goal is to have such a system integrated into remediation techniques, complementing current therapy strategies. In this work, we build upon existing methodologies and extend them. Our main contributions are (1) developing an explicit model of a-priori pronunciation errors for children in the target age range, and (2) explicit modeling of the acoustics of distorted phonemes. We begin by investigating previous approaches in the field of automated pronunciation assessments. Next, we introduce a database containing mispronunciations of children with speech sound disorders. We then describe our proposed approach, and then apply a variety of different evaluation metrics to understand the strengths and weaknesses of our proposed methods. Finally, we present discussion, conclusions, and future work that is planned to further improve the current research.
Section snippets
Review of Goodness of Pronunciation (GOP)
In this section, we explore in detail several different algorithmic approaches to computing the Goodness of Pronunciation (GOP) technique, followed by an examination of machine learning-based approaches that both contributed to the field and are pertinent to the proposed methods described in Section 4.1.
The GOP technique, originally defined by Witt and Young (1997), has been used for pronunciation assessment and has evolved throughout the years in order to improve the quality of the decision
Data collection
The data that was predominantly used in the experiment came from the Corpus of Children’s Pronunciation (CCP). In order to collect the data, we recruited 86 children aged 4–12 ( ). Co-occurrence of receptive and expressive language disorders is prevalent in children with speech production challenges, and so these children were screened to ensure that they had the ability to complete the tasks required in the study. The diagnosis of a speech-sound disorder was conferred by a licensed,
GOP-CI
Similar to our previous work (Dudy et al., 2015), we introduce the following algorithmic approach, which aims to improve the baseline GOP measure described in Eq. (7) of the LGOP approach by (1) learning acoustic models from a large children’s speech database aged 3–15 and then adapting to the speech of children in the final target age range of 4–11, (2) incorporating an explicit model of correct and incorrect pronunciations of the corpus described in the previous section, and (3) explicit
Raw results
In the first part of data analysis the raw results produced by the ASR are inspected. To visualize the recognition results by each of the elements of Eqs. (15)–(17) found in the GOP-CI and GOP-SVM approach, we extracted authentic examples from our data. In Fig. 9, there are four cases produced by our system. These cases show the segmentation of phonemes across the different elements; namely, the numerator (with the correct path) – “Num”, the correct-incorrect path “CI,” the Open Loop path “OL”,
Discussion and conclusions
In the current paper we proposed two automatic decision methods, GOP-CI and GOP-SVM, aimed at analyzing children’s speech for children who may be facing with speech-sound disorders. To evaluate the proposed methods, the most recent state of the art GOP – LGOP served as an automatic method of reference.
The first proposed method was the GOP-CI whose parameters were set using a grid search. The second proposed method was GOP-SVM which was trained using an SVM approach. Both methods learned
Future work
In order to create more robust models, one of our future plans is to improve the current state-of-the art for annotation scores. This could be done by adding an increased number of human expert annotators, producing a model that would learn a pattern according to a voting system and be able to reflect several experts decisions. However, this should take into account the level of disagreement rate among the annotators, since high levels of disagreement would produce an ineffective tool for
Acknowledgment
The presented research was supported by the National Institutes of Health (grant number R21DC012139).
References (42)
- et al.
Automatic speech recognition and speech variability: a review
Speech Commun.
(2007) - et al.
The decision tree classifier: design and potential
IEEE Trans. Geosci. Electron.
(1977) - et al.
Plosive/fricative distinction: the voiceless case
J. Acoust. Soc. Am.
(1990) - et al.
OpenFst: a general and efficient weighted finite-state transducer library
Proceedings of the Ninth International Conference on Implementation and Application of Automata (CIAA 2007)
(2007) - et al.
Fitting linear mixed-effects models using lme4
J. Stat. Softw.
(2015) - et al.
STAR: articulation training for young children
Proceedings of the 2000 INTERSPEECH Conference
(2000) The likelihood ratio, wald, and lagrange multiplier tests: an expository note
Am. Stat.
(1982)Incidence and Prevalence of Communication Disorders and Hearing Loss in Children
(2008)- et al.
Support vector machine
Mach. learn.
(1995) - et al.
Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch
J. Acoust. Soc. Am.
(2013)
Pronunciation analysis for children with speech sound disorders
Proceedings of the IEEE Thirty-Seventh Annual International Conference of the Engineering in Medicine and Biology Society (EMBC, 2015)
Speech technology in computer-aided language learning: strengths and limitations of a new CALL paradigm
Lang. Learn. Technol.
The FLUENCY pronunciation trainer
Proceedings of the 1998 STiLL Workshop
Treatment efficacy: functional phonological disorders in children
J. Speech Lang. Hear. Res.
Goldman Fristoe test of articulation
American Guidance Service
Phonological oppositions in children: a perceptual study
J. Acoust. Soc. Am.
Hidden Markov Models for Speech Recognition
A CALL system using speech recognition to train the pronunciation of Japanese long vowels, the mora nasal and mora obstruents
Proceedings of the 1997 EUROSPEECH , Rhodes, Greece
Distributional characteristics of VOT in children’s voiceless aspirated stops and interpretation of developmental trends
J. Speech Lang. Hear. Res.
Applied Linear Statistical Models
PLASER: pronunciation learning via automatic speech recognition
Proceedings of the HLT-NAACL Workshop on Building Educational Applications using Natural Language Processing
Cited by (26)
Developing children's ASR system under low-resource conditions using end-to-end architecture
2024, Digital Signal Processing: A Review JournalExploiting ultrasound tongue imaging for the automatic detection of speech articulation errors
2021, Speech CommunicationCitation Excerpt :Other sources, if available, may consist of in-domain speech from the learners’ native language or disordered speech. In-domain data can be used to develop extended search lattices accepting non-canonical pronunciation alternatives (Harrison et al., 2009; Ward et al., 2016; Dudy et al., 2018). The trained acoustic models and extended transducers then decode unseen utterances for which the text is known.
Multimodal Speech Training for the Hard of Hearing in Mandarine
2023, ROCLING 2023 - Proceedings of the 35th Conference on Computational Linguistics and Speech ProcessingAn Analysis of Goodness of Pronunciation for Child Speech
2023, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH