Automated assessment of prosody production☆
Introduction
Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of certain neurological conditions, as well as for foreign language instruction. This importance stems from the role prosody plays in speech intelligibility and comprehensibility (e.g., Wingfield et al., 1984, Silverman et al., 1993) and in social acceptance (e.g., McCann and Peppé, 2003, Peppé et al., 2006, Peppé et al., 2007), and from prosodic deficits in certain neurological conditions (e.g., stroke; House et al., 1987; or Parkinson’s Disease; Darley et al., 1969a, Darley et al., 1969b, Le Dorze et al., 1998).
Current assessment of speech, including that of prosody, is largely auditory-perceptual. As noted by Kent (1996; also see Kreiman and Gerratt, 1997), the reliability and validity of auditory-perceptual methods is often lower than desirable as the result of multiple factors, such as the difficulty of judging one aspect of speech without interference from other aspects (e.g., nasality judgments in the presence of varying degrees of hoarseness); the intrinsic multidimensional nature of certain judgment categories that require judges to weigh these dimensions (e.g., naturalness); the paucity of reference standards; and the difficulty of setting up truly “blind” judgment situations. Many of these issues are not specific to perceptual judgment of speech; in fact, there is an extensive body of literature on biases and inconsistencies in perceptual judgment going back several decades (e.g., Tversky, 1969).
Presumably, these issues would not be faced by automated (“instrumental”) speech assessment methods. Nevertheless, automated methods have largely been confined to analysis of voice features that are only marginally relevant for prosody (e.g., the Multi-Dimensional Voice Program™, or MDVP; Elemetrics, 1993). What obstacles are standing in the way of developing reliable automated prosody assessment methods?
An obstacle for any method, whether automated or auditory-perceptual, consists of the multiple “levels” of prosodic variability; for each level, one must distinguish between which deviations from some – generally ill-defined – norm are acceptable (e.g., due to speaking style) and which deviations are not (e.g., due to disease). One level of variability involves dialect. Work by Grabe et al., 2000, Grabe and Post, 2002, for example, has shown that prosodic differences between dialects of British English can be as large as prosodic differences between languages. Not surprisingly, dialect differences have been shown to create problems for auditory-perceptual assessment (e.g., assessment of speech naturalness; Mackey et al., 1997). Another level of variability involves differences between speakers that are not obviously due to dialect. These differences are sufficiently systematic to provide useful cues for speaker identification (e.g., Sönmez et al., 1998, Adami et al., 2003), and may involve several speaker characteristics, the most obvious of which are gender, age, and social class (e.g., Milroy and Milroy, 1978). At a third level, there is systematic within-speaker variability due to task demands (e.g., Hirschberg, 1995, Hirschberg, 2000), social context (e.g., Ofuka et al., 1994, Ofuka et al., 2000), and to emotional state (e.g., Scherer, 2003).
In addition to these forms of variability that are due to systematic factors, there is also variability that is apparently random. For example, in data reported by van Santen and Hirschberg(1994), in a highly confined task in which the speaker had to utter sentences of the type “Now I know 〈target word〉” in a prosodically consistent – and carefully monitored – manner, the initial boundary tone was found to have a range of 30 Hz while the final boundary tone had a range of only 3 Hz; the typical pitch range of these utterances was less than 100 Hz. This form of variability may be less of a challenge for auditory-perceptual methods because these methods may benefit from the human speech perception system’s ability to ignore communicatively irrelevant features of speech, but it clearly presents a challenge for automated methods.
There are additional aspects of prosody that pose complications and that are unrelated to variability. One is the intrinsically relative nature of many prosodic cues. For example, durational cues for the lexical stress status of a syllable are not in the form of some absolute duration but of how long or short the duration is compared to what can be expected based on the speaking rate, the segmental make-up of the syllable, and the location of the syllable in the word and phrase (van Santen, 1992, van Santen and Shih, 2000). A second aspect of prosody that poses complications for automated methods is that prosodic contrasts typically involve multiple acoustic features. To continue with the same example, lexical stress is expressed by a combination of duration, pitch, energy, spectral balance (e.g., Klatt, 1976, van Santen, 1992, Sluijter and van Heuven, 1996, van Santen and Niu, 2002, Miao et al., 2006), and additional features due to effects at the glottal level that are not fully captured by these basic acoustic features (e.g., glottal closing and opening slope; Marasek, 1996). Thus, there could be speaker-dependent trade-offs in terms of the relative strengths of these features, necessitating a fundamentally multidimensional approach to automated prosody assessment.
Both the intrinsic relativity of individual prosodic features and the trade-offs between them pose challenges for automated prosody assessment methods. These challenges seem to be fundamentally different from those posed by, for example, vowel production assessment. In a given phonemic context, vowel formant frequencies must lie within fairly narrow ranges in order for the vowel to be perceived as intended. While prosodic categories cannot even remotely be characterized by “point templates” in some conventional acoustic space, point template approaches for phonemic categories used by typical speech recognition systems clearly work rather well, via vector-based acoustic models in conjunction with some initial normalization step (e.g., cepstral normalization, vocal tract length normalization) and making basic allowances for coarticulation (e.g., by using phonemic-context dependent acoustic models).
Despite these obstacles for automated methods, there are obvious drawbacks to relying on auditory-perceptual methods and important advantages to using automated methods. First, we already mentioned validity and reliability issues of auditory-perceptual methods. Second, given the poor access many individuals have to services from speech-language pathologists or foreign language teachers, reliance on computerized prosody remediation or instruction is likely to increase. To be truly useful, such computerized systems should have the capability to provide accurate feedback; this, in turn, requires accurate automated assessment. Third, despite the exquisite sensitivity of human hearing, it is plausible that diagnostically relevant acoustic markers exist whose detection exceeds human capabilities. Detection of some promising markers, such as statistical features of pause durations in the course of a 5-min speech recording (e.g., Roark et al., 2007), might be cognitively too demanding. Others could have too low an SNR to be humanly detectable. The acoustic feature of jitter, for example, has potential for early detection of certain vocal fold anomalies (e.g., Zhang and Jiang, 2008, Murry and Doherty, 1980) but has fairly high perceptual thresholds, certainly with non-stationary pitch (e.g., Cardozo and Ritsma, 1968). In other words, exclusive reliance on auditory-perceptual procedures is not good for discovery of new diagnostic markers.
We thus conclude that automated measures of assessment of prosody production are much needed, but that constructing such measures faces specific challenges. In our approach, we use a combination of the following design principles that help us address these challenges. (i) Highly constraining elicitation methods (e.g., repeating a particular word with a specific stress pattern) to reduce unwanted prosodic variability due to, for example, contextual effects on speaking style. (ii) A “prosodic minimal pairs” design for all but one task, in which the list of items used to elicit speech consists of randomized pairs that are identical except for the prosodic contrast (e.g., the third item on the list is tauveeb and the seventh tauveeb, with italicizing indicating word stress). This serves to reduce the impact of confounding speaker characteristics, such as pitch range or vocal tract length; each speaker is his or her own control. (iii) Robust acoustic features that can handle, for example, mispronunciations and pitch tracking errors. (iv) Measures that consist of weighted combinations of multiple, maximally independent acoustic features, thereby allowing speakers to differ in the relative degrees to which they use these features. (v) Measures that include both global and dynamic features. Prosodic contrasts such as word stress are marked by pitch dynamics, while contrasts such as vocal affect can perhaps be characterized by global statistics. (vi) Parameter-poor (and even parameter-free) techniques in which the algorithms themselves either are based on established facts about prosody (e.g., the phrase-final lengthening phenomenon) or are developed in exploratory analyses of a separate data set whose characteristics are quite different from the main data in terms of speakers (e.g., adults and children ages 11–65 vs. children 4–7). In conjunction with (ii) and (iii), this serves to maximize the portability of the measures in order to minimize the influences of recording conditions, SNR, sample characteristics, and other factors that may be difficult to control across laboratories or clinics. Parameter-rich systems may lack such portability, since the parameter estimates may depend on the idiosyncrasies of the acoustic recording conditions and the training samples.
The goal of this paper is to describe the construction and validation of a number of prosody measures based on these design principles. The speech data were collected as part of an ongoing study of the production and interpretation of prosody in autism, whose aim is to detail prosodic difficulties in autism spectrum disorder, developmental language disorder, and typical development, in the age range of 4–8 years. The current paper focuses on methodology. Elsewhere we have presented preliminary findings on between-group differences on the suite of measures (Tucker Prud’hommeaux et al., 2008, van Santen et al., 2007, van Santen et al., 2008).
Section snippets
Speech elicitation methods
The tasks used for elicitation include variants and modifications of tasks in the PEPS-C (“Profiling Elements of Prosodic Systems – Children”; Peppé and McCann, 2003) paradigm, as well as of two tasks developed by Paul et al. (2005). In our study of prosody in autism, children complete tasks designed to test both their interpretation and their production of prosody. The present paper considers, in detail, the results of only the tasks related to the production of prosody. Findings are reviewed
Listening task results
For each task, the listener data can be represented as an table, where the N rows correspond to the utterances collected across all speakers, the 6 columns to the listeners, and the cells to the listener ratings. This per-utterance table can be condensed into a per-speaker table, by combining for each speaker the ratings of the k utterances as discussed in Section 2.5. We extend these tables by adding additional columns for mean listener scores (see below) and verified real-time scores (see
Pre-processing
Pre-processing consisted of the following steps: locating and extracting the child responses from the recordings; determining certain syllable, word, or phonetic segment boundaries (depending on the task and measure used; discussed in detail below); and extracting acoustic features, including: (i) fundamental frequency, or , using the The Snack Sound Toolkit (2006) and (ii) amplitude (in dB) in four formant-range based passbands (: 60–1200 Hz, : 1200–3100 Hz, : 3100–4000 Hz,
Evaluation of automated measures
The key evaluation criterion is the degree to which the objective measures, separately and in combination, can predict the mean listener scores, as measured by the product-moment correlation. There are many ways in which, for a given task, the multiple acoustic measures can be combined, including neural nets, support vector machines, and other linear or non-linear methods. We have decided to use simple linear regression because of its robustness and its small number of parameters, which may
Conclusions
The automated methods proposed in this paper succeeded in approximating human ratings in reliability, as assessed via correlations with auditory-perceptual mean listener ratings. In addition, the objective measures were superior to the conventional method of assessment in which the examiner makes real-time judgments and verifies these off-line. These automated methods could be of immediate practical value in terms of substantial labor savings and enhanced reliability.
More important, however,
Disclosure
This manuscript has not been published in whole or substantial part by another publisher and is not currently under review by another journal.
Acknowledgements
We thank Sue Peppé for making available the pictorial stimuli and for granting permission to us for creating new versions of several PEPS-C tasks; Lawrence Shriberg for helpful comments on an earlier draft of the paper, in particular on the LSR section; Rhea Paul for helpful comments on an earlier draft of the paper and for suggesting the Lexical Stress and Pragmatic Style Tasks; the clinical staff at CSLU (Beth Langhorst, Rachel Coulston, and Robbyn Sanger Hahn) and at Yale University (Nancy
References (65)
- et al.
Judgments of vocal affect by language-delayed children
J. Comm. Disord.
(1983) - et al.
A replication of the Autism Diagnostic Observation Schedule (ADOS) revised algorithms
J. Am. Acad. Child Adolesc. Psychiatr.
(2008) - et al.
Pitch accent realization in four varieties of British English
J. Phonet.
(2000) - et al.
Prosodic cues for rated politeness in Japanese speech
Speech Comm.
(2000) - et al.
Assessing prosodic and pragmatic ability in children with high-functioning autism
J. Pragmat.
(2006) Vocal communication of emotion: a review of research paradigms
Speech Comm.
(2003)Contextual effects on vowel duration
Speech Comm.
(1992)Perceptual experiments for diagnostic testing of text-to-speech, systems
Comput. Speech Lang.
(1993)- et al.
Acoustic analyses of sustained and running voices from patients with laryngeal pathologies
J. Voice
(2008) - Adami, A.G., Mihaescu, R., Reynolds, D.A., Godfrey, J.J., 2003. Modeling prosodic dynamics for speaker recognition. In:...
Statistical Inference Under Order Restrictions
A autism screening questionnaire: diagnostic validity
Brit. J. Psychiatr.
On the perception of imperfect periodicity
IEEE Trans. Audio Electroacoust.
A coefficient of agreement for nominal scale
J. Educat. Psychol. Measur.
Differential diagnostic patterns of dysarthria
J. Speech Hear. Res.
Clusters of deviant speech dimensions in the dysarthrias
J. Speech Hear. Res.
Nonword repetition and child language disorder
J. Speech Lang. Hear. Res.
Pictures of Facial Affect
High functioning people with autism and Asperger syndrome
The Autism Diagnostic Observation Schedule (ADOS): revised algorithms for improved diagnostic validity
J. Autism Dev. Disord.
A corpus-based approach to the study of speaking style
Affective prosody in the reading voice of stroke patients
J. Neurol. Neurosurg. Psychiatr.
Hearing and believing: some limits to the auditory-perceptual assessment of speech and voice disorders
Amer. J. Speech-Lang. Pathol.
Linguistic uses of segmental duration in English: acoustic and perceptual evidence
J. Acoust. Soc. Amer.
A comparison of the prosodic characteristics of the speech of people with Parkinson’s disease and Friedreich’s ataxia with neurologically normal speakers
Folia Phoniatr. Logop.
Acoustics of children’s speech: developmental changes of temporal and spectral parameters
J. Acoust. Soc. Amer.
Cited by (29)
Analysis of atypical prosodic patterns in the speech of people with Down syndrome
2021, Biomedical Signal Processing and ControlCitation Excerpt :Additionally, three duration features were extracted in the chunking task recordings: ROSvariation, computed as the ratio between the ROS of the first word in the sentence divided by the ROS of the other words in the same sentence; the length of the pause that follows the first word of the sentence (pauseLength), the maximum duration of the vocalic phones in the sentence z-normalized across the speaker group (maxVocalLength). These features are inspired in the works on chunking analysis presented in [51]. In parallel, we apply sequential forward feature selection (SFS) to select the most relevant features.
Intonation classification for L2 English speech using multi-distribution deep neural networks
2017, Computer Speech and LanguageCitation Excerpt :Hönig et al. (2010) applied multiple linear regression on a large prosodic feature vector to assess the quality of L2 learners’ utterances with respect to intelligibility, rhythm, melody, etc. The results were further improved by combining these features with those derived from a Gaussian mixture model (GMM), which was used as an universal background model (Hönig et al., 2011). van Santen et al. (2009) proposed approaches for automatic assessment of prosody production, including lexical stress, focus, phrasing, etc.
Automatic intonation assessment for computer aided language learning
2010, Speech CommunicationAutomated assessment of second language comprehensibility: Review, training, validation, and generalization studies
2023, Studies in Second Language Acquisition
- ☆
Preliminary results of this work were presented as posters at IMFAR 2007 and IMFAR 2008.