Elsevier

Computer Speech & Language

Volume 50, July 2018, Pages 62-84
Computer Speech & Language

Automatic analysis of pronunciations for children with speech sound disorders

https://doi.org/10.1016/j.csl.2017.12.006Get rights and content

Highlights

  • Propose two approaches for evaluating pronunciations for children with speechsound disorders.

  • GOP-Algorithm – an algorithm based approach.

  • GOP-SVM – a statistical learning based approach.

  • Prove that the proposed approaches outperform the current GOP state-of-the art approach using several different tests.

  • Describe the CCP corpus that contains data of correct and incorrect pronunciations that were essential for the training and testing processes of the proposed methods.

Abstract

Computer-Assisted Pronunciation Training (CAPT) systems aim to help a child learn the correct pronunciations of words. However, while there are many online commercial CAPT apps, there is no consensus among Speech Language Therapists (SLPs) or non-professionals about which CAPT systems, if any, work well. The prevailing assumption is that practicing with such programs is less reliable and thus does not provide the feedback necessary to allow children to improve their performance. The most common method for assessing pronunciation performance is the Goodness of Pronunciation (GOP) technique. Our paper proposes two new GOP techniques. We have found that pronunciation models that use explicit knowledge about error pronunciation patterns can lead to more accurate classification whether a phoneme was correctly pronounced or not. We evaluate the proposed pronunciation assessment methods against a baseline state of the art GOP approach, and show that the proposed techniques lead to classification performance that is more similar to that of a human expert.

Introduction

Phonological disorders are among the most prevalent communicative disabilities diagnosed in preschool and school-age children, accounting for 10% of this population (Gierut, 1998). The American Speech-Language Hearing Association determined that there is an observed relationship between early phonological disorders and subsequent reading, writing, spelling, and mathematical abilities (Castrogiovanni, 2008). Furthermore, speech production difficulties affect not only children’s communication and academic performance, but also their level of interaction with peers and adults. Considering the limited availability of speech language pathalogist (SLPs) (Squires, 2013), it is likely that a parent whose child was diagnosed with a phonological disorder would prefer to have their child practice and acquire language skills as quickly as possible, rather than relying solely on the limited time they can spend with a SLP.

Technology has provided one plausible solution to address the need of improving language skills. Advances in speech recognition technology in the early 90s (Huang et al., 1990) have attempted to address the problem of the lack of professional human resources to train children, specifically by developing computer assisted pronunciation (CAPT) systems (Witt and Young, 1997) that address children’s pronunciation disabilities. These pronunciation systems serve both as assistive tools for diagnosis as well as for practicing correct pronunciations (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Neri, Mich, Gerosa, Giuliani, 2008, Bunnell, Yarrington, Polikoff, 2000, Russell, Brown, Skilling, Series, Wallace, Bonham, Barker, 1996, Eskenazi, Hansma, 1998, Kawai, Hirose, September 22–25, 1997).

Unfortunately, many of the current technological tools are still limited. Despite progress in the field of speech processing, existing solutions have primarily concentrated on applying conventional automatic speech recognition (ASR) approaches to assess pronunciation in speech (Witt, Young, 1997, Neri, Cucchiarini, Strik, Boves, 2002, Ehsani, Knodt, 1998). However, conventional ASR approaches face problems in this area, as the speech of children is characterized by increased acoustic variability, while conventional ASR systems are trained to generate acoustic models from adult speech (Koenig, 2001). Moreover, some mispronunciation patterns may occur in children more frequently than in adults (Benzeghiba et al., 2007). As a result, while these automatic pronunciation systems hold promise, they are still not widely used since these technologies appear to be less reliable in terms of their performance (Ploog et al., 2013).

Other solutions have focused on explicitly modeling possible mispronunciations (Doremalen, Cucchiarini, Strik, 2013, Ronanki, Salsman, Li, 2012, Wei, Hu, Hu, Wang, 2009, Strik, Truong, De Wet, Cucchiarini, 2009); however, the state-of-the-art system using this approach has the limitation that every target pronunciation must be learned separately against its specific “competing” pronunciations (Doremalen et al., 2013) (i.e. mispronunciations). Therefore, the process of developing such a system requires substantial human supervision and input. In addition, the authors mention that in comparison to previous approaches, when introduced with low frequency pronunciation events, the system underperforms, as it requires learning of the specific features of every mispronunciation. The lack of a generalizable approach is a hurdle for automation, as is the need for a large corpus of training examples. An ideal system would incorporate knowledge of mispronunciations from a relatively small corpus while simultaneously require minimal human supervision and input.

Our research goals are derived from these limitations. The short-term objective of our research is to develop a method that will constitute the core component of an effective pronunciation analysis system for children aged 4–12 with speech sound disorders, enabling them to receive accurate feedback on their speech production, even in the absence of a clinician. The desired feedback addresses the question of whether children correctly pronounced a phoneme or not. In addition, to be effective, our system is designed to be highly automated. The long-term goal is to have such a system integrated into remediation techniques, complementing current therapy strategies. In this work, we build upon existing methodologies and extend them. Our main contributions are (1) developing an explicit model of a-priori pronunciation errors for children in the target age range, and (2) explicit modeling of the acoustics of distorted phonemes. We begin by investigating previous approaches in the field of automated pronunciation assessments. Next, we introduce a database containing mispronunciations of children with speech sound disorders. We then describe our proposed approach, and then apply a variety of different evaluation metrics to understand the strengths and weaknesses of our proposed methods. Finally, we present discussion, conclusions, and future work that is planned to further improve the current research.

Section snippets

Review of Goodness of Pronunciation (GOP)

In this section, we explore in detail several different algorithmic approaches to computing the Goodness of Pronunciation (GOP) technique, followed by an examination of machine learning-based approaches that both contributed to the field and are pertinent to the proposed methods described in Section 4.1.

The GOP technique, originally defined by Witt and Young (1997), has been used for pronunciation assessment and has evolved throughout the years in order to improve the quality of the decision

Data collection

The data that was predominantly used in the experiment came from the Corpus of Children’s Pronunciation (CCP). In order to collect the data, we recruited 86 children aged 4–12 (μ=5.3, σ=1.3). Co-occurrence of receptive and expressive language disorders is prevalent in children with speech production challenges, and so these children were screened to ensure that they had the ability to complete the tasks required in the study. The diagnosis of a speech-sound disorder was conferred by a licensed,

GOP-CI

Similar to our previous work (Dudy et al., 2015), we introduce the following algorithmic approach, which aims to improve the baseline GOP measure described in Eq. (7) of the LGOP approach by (1) learning acoustic models from a large children’s speech database aged 3–15 and then adapting to the speech of children in the final target age range of 4–11, (2) incorporating an explicit model of correct and incorrect pronunciations of the corpus described in the previous section, and (3) explicit

Raw results

In the first part of data analysis the raw results produced by the ASR are inspected. To visualize the recognition results by each of the elements of Eqs. (15)–(17) found in the GOP-CI and GOP-SVM approach, we extracted authentic examples from our data. In Fig. 9, there are four cases produced by our system. These cases show the segmentation of phonemes across the different elements; namely, the numerator (with the correct path) – “Num”, the correct-incorrect path “CI,” the Open Loop path “OL”,

Discussion and conclusions

In the current paper we proposed two automatic decision methods, GOP-CI and GOP-SVM, aimed at analyzing children’s speech for children who may be facing with speech-sound disorders. To evaluate the proposed methods, the most recent state of the art GOP – LGOP served as an automatic method of reference.

The first proposed method was the GOP-CI whose parameters were set using a grid search. The second proposed method was GOP-SVM which was trained using an SVM approach. Both methods learned

Future work

In order to create more robust models, one of our future plans is to improve the current state-of-the art for annotation scores. This could be done by adding an increased number of human expert annotators, producing a model that would learn a pattern according to a voting system and be able to reflect several experts decisions. However, this should take into account the level of disagreement rate among the annotators, since high levels of disagreement would produce an ineffective tool for

Acknowledgment

The presented research was supported by the National Institutes of Health (grant number R21DC012139).

References (42)

  • M. Benzeghiba et al.

    Automatic speech recognition and speech variability: a review

    Speech Commun.

    (2007)
  • P. Swain et al.

    The decision tree classifier: design and potential

    IEEE Trans. Geosci. Electron.

    (1977)
  • L. Weigelt et al.

    Plosive/fricative distinction: the voiceless case

    J. Acoust. Soc. Am.

    (1990)
  • C. Allauzen et al.

    OpenFst: a general and efficient weighted finite-state transducer library

    Proceedings of the Ninth International Conference on Implementation and Application of Automata (CIAA 2007)

    (2007)
  • D. Bates et al.

    Fitting linear mixed-effects models using lme4

    J. Stat. Softw.

    (2015)
  • T. Bunnell et al.

    STAR: articulation training for young children

    Proceedings of the 2000 INTERSPEECH Conference

    (2000)
  • A. Buse

    The likelihood ratio, wald, and lagrange multiplier tests: an expository note

    Am. Stat.

    (1982)
  • A. Castrogiovanni

    Incidence and Prevalence of Communication Disorders and Hearing Loss in Children

    (2008)
  • C. Cortes et al.

    Support vector machine

    Mach. learn.

    (1995)
  • V.J. Doremalen et al.

    Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch

    J. Acoust. Soc. Am.

    (2013)
  • S. Dudy et al.

    Pronunciation analysis for children with speech sound disorders

    Proceedings of the IEEE Thirty-Seventh Annual International Conference of the Engineering in Medicine and Biology Society (EMBC, 2015)

    (2015)
  • F. Ehsani et al.

    Speech technology in computer-aided language learning: strengths and limitations of a new CALL paradigm

    Lang. Learn. Technol.

    (1998)
  • M. Eskenazi et al.

    The FLUENCY pronunciation trainer

    Proceedings of the 1998 STiLL Workshop

    (1998)
  • J. Gierut

    Treatment efficacy: functional phonological disorders in children

    J. Speech Lang. Hear. Res.

    (1998)
  • R. Goldman et al.

    Goldman Fristoe test of articulation

    American Guidance Service

    (1986)
  • L. Graham et al.

    Phonological oppositions in children: a perceptual study

    J. Acoust. Soc. Am.

    (1971)
  • HuangX. et al.

    Hidden Markov Models for Speech Recognition

    (1990)
  • G. Kawai et al.

    A CALL system using speech recognition to train the pronunciation of Japanese long vowels, the mora nasal and mora obstruents

    Proceedings of the 1997 EUROSPEECH , Rhodes, Greece

    (September 22–25, 1997)
  • L. Koenig

    Distributional characteristics of VOT in children’s voiceless aspirated stops and interpretation of developmental trends

    J. Speech Lang. Hear. Res.

    (2001)
  • M. Kutner et al.

    Applied Linear Statistical Models

    (2005)
  • B. Mak et al.

    PLASER: pronunciation learning via automatic speech recognition

    Proceedings of the HLT-NAACL Workshop on Building Educational Applications using Natural Language Processing

    (2003)
  • Cited by (26)

    • Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors

      2021, Speech Communication
      Citation Excerpt :

      Other sources, if available, may consist of in-domain speech from the learners’ native language or disordered speech. In-domain data can be used to develop extended search lattices accepting non-canonical pronunciation alternatives (Harrison et al., 2009; Ward et al., 2016; Dudy et al., 2018). The trained acoustic models and extended transducers then decode unseen utterances for which the text is known.

    • Multimodal Speech Training for the Hard of Hearing in Mandarine

      2023, ROCLING 2023 - Proceedings of the 35th Conference on Computational Linguistics and Speech Processing
    • An Analysis of Goodness of Pronunciation for Child Speech

      2023, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    View all citing articles on Scopus
    View full text