Elsevier

Speech Communication

Volume 51, Issue 11, November 2009, Pages 1082-1097
Speech Communication

Automated assessment of prosody production

https://doi.org/10.1016/j.specom.2009.04.007Get rights and content

Abstract

Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of neurological conditions, and for foreign language instruction. Current assessment is largely auditory-perceptual, which has obvious drawbacks; however, automation of assessment faces numerous obstacles. We propose methods for automatically assessing production of lexical stress, focus, phrasing, pragmatic style, and vocal affect. Speech was analyzed from children in six tasks designed to elicit specific prosodic contrasts. The methods involve dynamic and global features, using spectral, fundamental frequency, and temporal information. The automatically computed scores were validated against mean scores from judges who, in all but one task, listened to “prosodic minimal pairs” of recordings, each pair containing two utterances from the same child with approximately the same phonemic material but differing on a specific prosodic dimension, such as stress. The judges identified the prosodic categories of the two utterances and rated the strength of their contrast. For almost all tasks, we found that the automated scores correlated with the mean scores approximately as well as the judges’ individual scores. Real-time scores assigned during examination – as is fairly typical in speech assessment – correlated substantially less than the automated scores with the mean scores.

Introduction

Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of certain neurological conditions, as well as for foreign language instruction. This importance stems from the role prosody plays in speech intelligibility and comprehensibility (e.g., Wingfield et al., 1984, Silverman et al., 1993) and in social acceptance (e.g., McCann and Peppé, 2003, Peppé et al., 2006, Peppé et al., 2007), and from prosodic deficits in certain neurological conditions (e.g., stroke; House et al., 1987; or Parkinson’s Disease; Darley et al., 1969a, Darley et al., 1969b, Le Dorze et al., 1998).

Current assessment of speech, including that of prosody, is largely auditory-perceptual. As noted by Kent (1996; also see Kreiman and Gerratt, 1997), the reliability and validity of auditory-perceptual methods is often lower than desirable as the result of multiple factors, such as the difficulty of judging one aspect of speech without interference from other aspects (e.g., nasality judgments in the presence of varying degrees of hoarseness); the intrinsic multidimensional nature of certain judgment categories that require judges to weigh these dimensions (e.g., naturalness); the paucity of reference standards; and the difficulty of setting up truly “blind” judgment situations. Many of these issues are not specific to perceptual judgment of speech; in fact, there is an extensive body of literature on biases and inconsistencies in perceptual judgment going back several decades (e.g., Tversky, 1969).

Presumably, these issues would not be faced by automated (“instrumental”) speech assessment methods. Nevertheless, automated methods have largely been confined to analysis of voice features that are only marginally relevant for prosody (e.g., the Multi-Dimensional Voice Program™, or MDVP; Elemetrics, 1993). What obstacles are standing in the way of developing reliable automated prosody assessment methods?

An obstacle for any method, whether automated or auditory-perceptual, consists of the multiple “levels” of prosodic variability; for each level, one must distinguish between which deviations from some – generally ill-defined – norm are acceptable (e.g., due to speaking style) and which deviations are not (e.g., due to disease). One level of variability involves dialect. Work by Grabe et al., 2000, Grabe and Post, 2002, for example, has shown that prosodic differences between dialects of British English can be as large as prosodic differences between languages. Not surprisingly, dialect differences have been shown to create problems for auditory-perceptual assessment (e.g., assessment of speech naturalness; Mackey et al., 1997). Another level of variability involves differences between speakers that are not obviously due to dialect. These differences are sufficiently systematic to provide useful cues for speaker identification (e.g., Sönmez et al., 1998, Adami et al., 2003), and may involve several speaker characteristics, the most obvious of which are gender, age, and social class (e.g., Milroy and Milroy, 1978). At a third level, there is systematic within-speaker variability due to task demands (e.g., Hirschberg, 1995, Hirschberg, 2000), social context (e.g., Ofuka et al., 1994, Ofuka et al., 2000), and to emotional state (e.g., Scherer, 2003).

In addition to these forms of variability that are due to systematic factors, there is also variability that is apparently random. For example, in data reported by van Santen and Hirschberg(1994), in a highly confined task in which the speaker had to utter sentences of the type “Now I know 〈target word〉” in a prosodically consistent – and carefully monitored – manner, the initial boundary tone was found to have a range of 30 Hz while the final boundary tone had a range of only 3 Hz; the typical pitch range of these utterances was less than 100 Hz. This form of variability may be less of a challenge for auditory-perceptual methods because these methods may benefit from the human speech perception system’s ability to ignore communicatively irrelevant features of speech, but it clearly presents a challenge for automated methods.

There are additional aspects of prosody that pose complications and that are unrelated to variability. One is the intrinsically relative nature of many prosodic cues. For example, durational cues for the lexical stress status of a syllable are not in the form of some absolute duration but of how long or short the duration is compared to what can be expected based on the speaking rate, the segmental make-up of the syllable, and the location of the syllable in the word and phrase (van Santen, 1992, van Santen and Shih, 2000). A second aspect of prosody that poses complications for automated methods is that prosodic contrasts typically involve multiple acoustic features. To continue with the same example, lexical stress is expressed by a combination of duration, pitch, energy, spectral balance (e.g., Klatt, 1976, van Santen, 1992, Sluijter and van Heuven, 1996, van Santen and Niu, 2002, Miao et al., 2006), and additional features due to effects at the glottal level that are not fully captured by these basic acoustic features (e.g., glottal closing and opening slope; Marasek, 1996). Thus, there could be speaker-dependent trade-offs in terms of the relative strengths of these features, necessitating a fundamentally multidimensional approach to automated prosody assessment.

Both the intrinsic relativity of individual prosodic features and the trade-offs between them pose challenges for automated prosody assessment methods. These challenges seem to be fundamentally different from those posed by, for example, vowel production assessment. In a given phonemic context, vowel formant frequencies must lie within fairly narrow ranges in order for the vowel to be perceived as intended. While prosodic categories cannot even remotely be characterized by “point templates” in some conventional acoustic space, point template approaches for phonemic categories used by typical speech recognition systems clearly work rather well, via vector-based acoustic models in conjunction with some initial normalization step (e.g., cepstral normalization, vocal tract length normalization) and making basic allowances for coarticulation (e.g., by using phonemic-context dependent acoustic models).

Despite these obstacles for automated methods, there are obvious drawbacks to relying on auditory-perceptual methods and important advantages to using automated methods. First, we already mentioned validity and reliability issues of auditory-perceptual methods. Second, given the poor access many individuals have to services from speech-language pathologists or foreign language teachers, reliance on computerized prosody remediation or instruction is likely to increase. To be truly useful, such computerized systems should have the capability to provide accurate feedback; this, in turn, requires accurate automated assessment. Third, despite the exquisite sensitivity of human hearing, it is plausible that diagnostically relevant acoustic markers exist whose detection exceeds human capabilities. Detection of some promising markers, such as statistical features of pause durations in the course of a 5-min speech recording (e.g., Roark et al., 2007), might be cognitively too demanding. Others could have too low an SNR to be humanly detectable. The acoustic feature of jitter, for example, has potential for early detection of certain vocal fold anomalies (e.g., Zhang and Jiang, 2008, Murry and Doherty, 1980) but has fairly high perceptual thresholds, certainly with non-stationary pitch (e.g., Cardozo and Ritsma, 1968). In other words, exclusive reliance on auditory-perceptual procedures is not good for discovery of new diagnostic markers.

We thus conclude that automated measures of assessment of prosody production are much needed, but that constructing such measures faces specific challenges. In our approach, we use a combination of the following design principles that help us address these challenges. (i) Highly constraining elicitation methods (e.g., repeating a particular word with a specific stress pattern) to reduce unwanted prosodic variability due to, for example, contextual effects on speaking style. (ii) A “prosodic minimal pairs” design for all but one task, in which the list of items used to elicit speech consists of randomized pairs that are identical except for the prosodic contrast (e.g., the third item on the list is tauveeb and the seventh tauveeb, with italicizing indicating word stress). This serves to reduce the impact of confounding speaker characteristics, such as pitch range or vocal tract length; each speaker is his or her own control. (iii) Robust acoustic features that can handle, for example, mispronunciations and pitch tracking errors. (iv) Measures that consist of weighted combinations of multiple, maximally independent acoustic features, thereby allowing speakers to differ in the relative degrees to which they use these features. (v) Measures that include both global and dynamic features. Prosodic contrasts such as word stress are marked by pitch dynamics, while contrasts such as vocal affect can perhaps be characterized by global statistics. (vi) Parameter-poor (and even parameter-free) techniques in which the algorithms themselves either are based on established facts about prosody (e.g., the phrase-final lengthening phenomenon) or are developed in exploratory analyses of a separate data set whose characteristics are quite different from the main data in terms of speakers (e.g., adults and children ages 11–65 vs. children 4–7). In conjunction with (ii) and (iii), this serves to maximize the portability of the measures in order to minimize the influences of recording conditions, SNR, sample characteristics, and other factors that may be difficult to control across laboratories or clinics. Parameter-rich systems may lack such portability, since the parameter estimates may depend on the idiosyncrasies of the acoustic recording conditions and the training samples.

The goal of this paper is to describe the construction and validation of a number of prosody measures based on these design principles. The speech data were collected as part of an ongoing study of the production and interpretation of prosody in autism, whose aim is to detail prosodic difficulties in autism spectrum disorder, developmental language disorder, and typical development, in the age range of 4–8 years. The current paper focuses on methodology. Elsewhere we have presented preliminary findings on between-group differences on the suite of measures (Tucker Prud’hommeaux et al., 2008, van Santen et al., 2007, van Santen et al., 2008).

Section snippets

Speech elicitation methods

The tasks used for elicitation include variants and modifications of tasks in the PEPS-C (“Profiling Elements of Prosodic Systems – Children”; Peppé and McCann, 2003) paradigm, as well as of two tasks developed by Paul et al. (2005). In our study of prosody in autism, children complete tasks designed to test both their interpretation and their production of prosody. The present paper considers, in detail, the results of only the tasks related to the production of prosody. Findings are reviewed

Listening task results

For each task, the listener data can be represented as an N×6 table, where the N rows correspond to the utterances collected across all speakers, the 6 columns to the listeners, and the cells to the listener ratings. This per-utterance table can be condensed into a per-speaker table, by combining for each speaker the ratings of the k utterances as discussed in Section 2.5. We extend these tables by adding additional columns for mean listener scores (see below) and verified real-time scores (see

Pre-processing

Pre-processing consisted of the following steps: locating and extracting the child responses from the recordings; determining certain syllable, word, or phonetic segment boundaries (depending on the task and measure used; discussed in detail below); and extracting acoustic features, including: (i) fundamental frequency, or F0, using the The Snack Sound Toolkit (2006) and (ii) amplitude (in dB) in four formant-range based passbands (B1(t): 60–1200 Hz, B2(t): 1200–3100 Hz, B3(t): 3100–4000 Hz, B4(t)

Evaluation of automated measures

The key evaluation criterion is the degree to which the objective measures, separately and in combination, can predict the mean listener scores, as measured by the product-moment correlation. There are many ways in which, for a given task, the multiple acoustic measures can be combined, including neural nets, support vector machines, and other linear or non-linear methods. We have decided to use simple linear regression because of its robustness and its small number of parameters, which may

Conclusions

The automated methods proposed in this paper succeeded in approximating human ratings in reliability, as assessed via correlations with auditory-perceptual mean listener ratings. In addition, the objective measures were superior to the conventional method of assessment in which the examiner makes real-time judgments and verifies these off-line. These automated methods could be of immediate practical value in terms of substantial labor savings and enhanced reliability.

More important, however,

Disclosure

This manuscript has not been published in whole or substantial part by another publisher and is not currently under review by another journal.

Acknowledgements

We thank Sue Peppé for making available the pictorial stimuli and for granting permission to us for creating new versions of several PEPS-C tasks; Lawrence Shriberg for helpful comments on an earlier draft of the paper, in particular on the LSR section; Rhea Paul for helpful comments on an earlier draft of the paper and for suggesting the Lexical Stress and Pragmatic Style Tasks; the clinical staff at CSLU (Beth Langhorst, Rachel Coulston, and Robbyn Sanger Hahn) and at Yale University (Nancy

References (65)

  • R.E. Barlow et al.

    Statistical Inference Under Order Restrictions

    (1972)
  • S. Berument et al.

    A autism screening questionnaire: diagnostic validity

    Brit. J. Psychiatr.

    (1999)
  • B. Cardozo et al.

    On the perception of imperfect periodicity

    IEEE Trans. Audio Electroacoust.

    (1968)
  • J. Cohen

    A coefficient of agreement for nominal scale

    J. Educat. Psychol. Measur.

    (1960)
  • F.L. Darley et al.

    Differential diagnostic patterns of dysarthria

    J. Speech Hear. Res.

    (1969)
  • F.L. Darley et al.

    Clusters of deviant speech dimensions in the dysarthrias

    J. Speech Hear. Res.

    (1969)
  • C. Dollaghan et al.

    Nonword repetition and child language disorder

    J. Speech Lang. Hear. Res.

    (1998)
  • DSM-IV-TR, 2002. Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric...
  • P. Ekman et al.

    Pictures of Facial Affect

    (1976)
  • Elemetrics, 1993. Multi-Dimensional Voice Program (MDVP). [Computer program.] Pine Brook,...
  • Entropic Research Laboratory Inc., 1996. ESPS Programs...
  • C. Gilberg et al.

    High functioning people with autism and Asperger syndrome

  • K. Gotham et al.

    The Autism Diagnostic Observation Schedule (ADOS): revised algorithms for improved diagnostic validity

    J. Autism Dev. Disord.

    (2007)
  • Grabe, E., Post, B., 2002. Intonational variation in English. In: Proc. Speech Prosody 2002 Conference, 11–13 April...
  • Hirschberg, J., 1995. Prosodic and other acoustic cues to speaking style in spontaneous and read speech. In: Proc....
  • J. Hirschberg

    A corpus-based approach to the study of speaking style

  • A. House et al.

    Affective prosody in the reading voice of stroke patients

    J. Neurol. Neurosurg. Psychiatr.

    (1987)
  • R.D. Kent

    Hearing and believing: some limits to the auditory-perceptual assessment of speech and voice disorders

    Amer. J. Speech-Lang. Pathol.

    (1996)
  • Klabbers, E., Mishra, T., van Santen, J., 2007. Analysis of affective speech recordings using the superpositional...
  • D. Klatt

    Linguistic uses of segmental duration in English: acoustic and perceptual evidence

    J. Acoust. Soc. Amer.

    (1976)
  • G. Le Dorze et al.

    A comparison of the prosodic characteristics of the speech of people with Parkinson’s disease and Friedreich’s ataxia with neurologically normal speakers

    Folia Phoniatr. Logop.

    (1998)
  • S. Lee et al.

    Acoustics of children’s speech: developmental changes of temporal and spectral parameters

    J. Acoust. Soc. Amer.

    (1999)
  • Cited by (29)

    • Analysis of atypical prosodic patterns in the speech of people with Down syndrome

      2021, Biomedical Signal Processing and Control
      Citation Excerpt :

      Additionally, three duration features were extracted in the chunking task recordings: ROSvariation, computed as the ratio between the ROS of the first word in the sentence divided by the ROS of the other words in the same sentence; the length of the pause that follows the first word of the sentence (pauseLength), the maximum duration of the vocalic phones in the sentence z-normalized across the speaker group (maxVocalLength). These features are inspired in the works on chunking analysis presented in [51]. In parallel, we apply sequential forward feature selection (SFS) to select the most relevant features.

    • Intonation classification for L2 English speech using multi-distribution deep neural networks

      2017, Computer Speech and Language
      Citation Excerpt :

      Hönig et al. (2010) applied multiple linear regression on a large prosodic feature vector to assess the quality of L2 learners’ utterances with respect to intelligibility, rhythm, melody, etc. The results were further improved by combining these features with those derived from a Gaussian mixture model (GMM), which was used as an universal background model (Hönig et al., 2011). van Santen et al. (2009) proposed approaches for automatic assessment of prosody production, including lexical stress, focus, phrasing, etc.

    View all citing articles on Scopus

    Preliminary results of this work were presented as posters at IMFAR 2007 and IMFAR 2008.

    View full text