Phrase-level speech simulation with an airway modulation model of speech production

https://doi.org/10.1016/j.csl.2012.10.005Get rights and content

Abstract

Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated.

Highlights

► An airway modulation model of speech production was developed for simulating speech. ► Model-based simulations are shown for two words and two phrases. ► Audio samples and animations of vocal tract movement are included for each simulation.

Introduction

Speech is produced by transforming the motion of anatomical structures into an acoustic wave embedded with the distinctive characteristics of speech. This transformation can be conceived as a modulation of the human airway system on multiple time scales. For example, the rapid vibration of the vocal folds modulates the airspace between them (i.e., the glottis) on the order of 100–400 cycles per second to generate a train of flow pulses that excites the acoustic resonances of the trachea, vocal tract, and nasal passages. Simultaneous, but much slower movements of the tongue, jaw, lips, velum, and larynx can be executed to modulate the shape of the pharyngeal and oral cavities, coupling to the nasal system, and space between the vocal folds by adduction and abduction maneuvers. These relatively slow modulations shift the acoustic resonances up or down in frequency and valve the flow of air through the system, thus altering the characteristics of the radiated acoustic wave over time, and providing the stimulus from which listeners can extract phonetic information.

The view that human speech is produced by a modulation system was expressed by Dudley (1940) in an article called “The carrier nature of speech.” In it he referred to the relatively high-frequency excitation provided by phonation or noise generation as “carrier waves” that are modulated by slowly-varying, and otherwise inaudible, movements of the vocal tract called “message waves.” He based this view on experience in developing both the vocoder (Dudley, 1939) and the human-operated voder (Dudley et al., 1939), and, in the conclusion, made a curious point that a wide variety of carrier signals – even nonhuman sounds such as instrumental music – could be modulated by the “message waves” and still produce intelligible “speech.” This points to the importance of understanding articulatory movement in terms of how it modulates the shape of pharyngeal and oral airspaces over time which, in turn, modulates the acoustic characteristics of the speech signal. Traunmüller (1994) also proposed a modulation theory in which speech signals are considered to be the result of articulatory gestures, common across speakers, that modulate a “carrier” signal unique to the speaker. In this theory, however, the carrier signal is not simply the excitation signal, but includes any aspects of the system that are phonetically neutral and descriptive of the “personal quality” of the speaker. This suggests that embedded within the carrier would be contributions of the biological structure of the vocal tract as well as any idiosyncratic vocal tract shaping patterns, all of which would be modulated during speech production by linguistically meaningful gestures.

Studying speech as a modulation system can be aided by models that allow for enough control of relevant parameters to generate speech or speech-like sounds. Within such models, the shape of the trachea, vocal tract, and nasal passages is usually represented as a tubular system, quantified by a set of area functions (cf., Fant, 1960, Baer et al., 1991, Story et al., 1996). This permits computing the acoustic wave propagation through the system with one-dimensional methods in the time domain (cf., Kelly and Lochbaum, 1962, Maeda, 1982, Strube, 1982, Liljencrants, 1985, Smith, 1992) or frequency domain (Sondhi and Schroeter, 1987). Typically for speech, only the vocal tract portion, along with the nasal coupling region, is considered to be time-varying. Thus, the challenge for developing a model that can “speak” is to define a set of parameters that allow efficient, time-dependent control of the shape of the vocal tract area function and coupling to the nasal system.

An articulatory synthesizer is perhaps the most intuitively-appealing approach to controlling the vocal tract because the model parameters consist of positions and movements of the tongue, jaw, lips, velum, etc. These are often represented in the two-dimensional midsagittal plane (cf., Lindblom and Sundberg, 1971, Mermelstein, 1973, Coker, 1976, Maeda, 1990, Scully, 1990) or as more complex three-dimensional models of articulatory structures (Dang and Honda, 2004, Birkholz et al., 2006, Birkholz et al., 2007), where, in either case, articulatory motion can be simulated by specifying the temporal variation of the model parameters. At any given instant of time, however, the articulatory configuration must be converted to an area function by empirically-based rules in order to calculate acoustic wave propagation, and ultimately produce an acoustic speech signal suitable for analysis or listening (e.g., Rubin et al., 1981, Birkholz et al., 2010, Bauer et al., 2010).

Other approaches consist of parameterizing the area function directly, rather than attending specifically to the anatomical structures. These are particularly useful when precise control of the vocal tract shape is desired. Early examples of this approach are the three-parameter models of Fant (1960) and Stevens and House (1955) in which the area function was described by a parabola controlled by a primary constriction location, cross-sectional area at that location, and a ratio of lip opening length to its area. These models were later modified to include various enhancements (Atal et al., 1978, Lin, 1990, Fant, 1992, Fant, 2001). Another type of area function model was proposed by Mrayati et al. (1988). The model parameters were not directly related to articulation but rather to portions of the area function determined to be acoustically-sensitive to changes in cross-sectional area.

Using any of these types of models to produce connected, coarticulated speech requires that the parameters allow for blending of the vowel and consonant contributions to the vocal tract shape. Öhman, 1966, Öhman, 1967 suggested that a consonant gesture (localized constriction) is superimposed on an underlying vowel substrate, rather than considering consonant and vowel to be separate, linearly sequenced gestures. Based on this view, Båvegård, 1995, Fant and Båvegård, 1997 detailed an area function model in which the vowel contribution was represented by a three-parameter model (Fant, 1992), as mentioned previously, and a consonant constriction function that could be superimposed on the vowel configuration to alter the shape of the area function at a particular location. Ohman's notion of vowel and consonant overlap has also influenced theoretical views of speech motor control. Gracco (1992), for instance, suggested that the vocal tract be considered the smallest unit of functional behavior for speech production, and that the movements of the vocal tract could be classified into “shaping” and “valving” actions. Relatively slow changes of overall vocal tract geometry coincide with the shaping category and would generally be associated with vowel production, whereas valving actions would impose and release localized constrictions, primarily for consonants.

Story (2005a) introduced an area function model conceptually similar to that of Båvegård, 1995, Fant and Båvegård, 1997. That is, the model operates under the assumption that consonantal, or more accurately, obstruent-like constrictions can be superimposed on an underlying vowel-like area function to momentarily produce an occlusion or partial occlusion at some location in the vocal tract. It is different, however, in that the vowel substrate is generated by superimposing two shaping patterns, called modes (Story and Titze, 1998), on an otherwise neutral vocal tract configuration. Thus, at any instant in time, the shape of the area function and the subsequent acoustic output include contributions from multiple layers of modulation: (1) idiosyncratic characteristics of the neutral vocal tract, (2) the overall shaping influence of the modes, and (3) possible valving (constriction) functions that force some part of area function to become zero or nearly so. The framework for the model is supported by work based on analysis of data from both MRI and X-ray microbeam articulography (Westbury, 1994). Story, 2005b, Story, 2007 has shown that the two shaping modes seem to be fairly similar across talkers for vowel production, whereas, the mean or neutral vocal tract shape is generally unique to a talker. In addition, Story (2009b) has shown that a time-varying vocal tract shape representative of a VCV utterance can be separated into a vowel-like substrate and contributions from the obstruent-like constriction.

The purpose of this article is to describe an airway modulation model of speech production based on the parametric system reported in Story (2005a), and to demonstrate that it can be used to simulate word-level and phrase-level utterances. The components of the model consist of kinematic representations of the medial surfaces of the vocal folds (Titze, 1984, Titze, 2006) and the shape of vocal tract as an area function (Story, 2005a), as well as static representations of the nasal passages and sinuses, and trachea. The assembly of these components into a system that can generate simulated speech has been used recently to investigate the acoustics and perception of vowels, consonants, and various voice qualities (Bunton and Story, 2009, Bunton and Story, 2010, Bunton and Story, 2011, Story and Bunton, 2010, Samlan and Story, 2011). These studies, however, have focused on simple utterances such as isolated vowels and VCVs which are fairly straightforward to simulate with systematic variation of model parameters. The aim in this article is to present a pilot study of several cases in which the model parameters are varied in more complex fashion to simulate a small collection of words and phrases. In the first part, the background and main components of the model are briefly described. The use of the model to generate words and phrases will be demonstrated in the second part.

Section snippets

Airway modulation model

The airway modulation model is constructed such that a baseline configuration of the laryngeal, vocal tract, and nasal systems would, without any imposed control, produce a neutral, monotone vowel sound. The model parameters can then be activated such that they alter the baseline configuration in some desired manner such that the acoustic properties of the generated signal is changed. The parameters of the model are controlled by a set of hierarchical tiers as shown in Fig. 1. The leftmost

Simulation of word-level and phrase-level speech

In this section, two words and two phrases simulated with the TubeTalker system are demonstrated graphically and with multimedia content. The words are “Ohio” and “Abracadabra”, and the phrases are “He had a rabbit” and “The brown cow.” These were chosen to show how the various components of the TubeTalker model can be used to produce a range of speech sounds in a connected speech context. The word Ohio provides a case where the change in vocal tract shape is based only on vowels, and the

Conclusion

An airway modulation model called TubeTalker was introduced as a system for generating artificial speech. The overall goal in developing the model is to facilitate an understanding of how modulations of the basic structure of the glottis and vocal tract are acoustically encoded in the time variation of the speech signal, and perceptually decoded by a listener into phonetic elements. The model encodes the speech signal by assuming that: (1) an acoustically neutral state of the vocal tract

Acknowledgments

This research was supported by NIH R01 DC04789 and NIH R01 DC011275. A preliminary version of this work was presented at the 2011 International Workshop on Performative Speech and Singing Synthesis in Vancouver, BC.

References (59)

  • P. Birkholz et al.

    Construction and control of a three-dimensional vocal tract model

  • P. Birkholz et al.

    Simulation of losses due to turbulence in the time-varying vocal system

    IEEE Transactions on Audio, Speech, and Language Processing

    (2007)
  • P.D. Birkholz et al.

    Articulatory synthesis and perception of plosive-vowel syllables with virtual consonant targets

  • K. Bunton et al.

    Identification of synthetic vowels based on selected vocal tract area functions

    Journal of the Acoustical Society of America

    (2009)
  • K. Bunton et al.

    Identification of synthetic vowels based on a time-varying model of the vocal tract area function

    Journal of the Acoustical Society of America

    (2010)
  • K. Bunton et al.

    The relation of nasality and nasalance to nasal port area based on a computational model

    The Cleft Palate-Craniofacial Journal

    (2011)
  • C.H. Coker

    A model of articulatory dynamics and control

    Proceedings of IEEE

    (1976)
  • J. Dang et al.

    Construction and control of a physiological articulatory model

    Journal of the Acoustical Society of America

    (2004)
  • H. Dudley

    Remaking speech

    Journal of the Acoustical Society of America

    (1939)
  • H. Dudley

    The carrier nature of speech

    Bell System Technical Journal

    (1940)
  • G. Fant

    The Acoustic Theory of Speech Production

    (1960)
  • G. Fant

    Vocal tract area functions of Swedish vowels and a new three-parameter model

  • G. Fant

    Swedish vowels and a new three-parameter model

    TMH-QPSR

    (2001)
  • G. Fant et al.

    Parametric model of VT area functions: vowels and consonants

    TMH-QPSR

    (1997)
  • J.L. Flanagan et al.

    Excitation of vocal-tract synthesizers

    Journal of the Acoustical Society of America

    (1969)
  • V.L. Gracco

    Characteristics of speech as a motor control system

    The Haskins Laboratories Status Report on Speech Research

    (1992)
  • J.L. Kelly et al.

    Speech synthesis

  • T.I. Laakso et al.

    Splitting the unit delay

    IEEE Transactions on Signal Processing Magazine

    (1996)
  • Liljencrants, J., 1985. Speech synthesis with a reflection-type line analog. DS Dissertation. Dept. of Speech Comm. and...
  • Cited by (62)

    • The Effects of Remote Signal Transmission and Recording on Acoustical Measures of Simulated Essential Vocal Tremor: Considerations for Remote Treatment Research and Telepractice

      2024, Journal of Voice
      Citation Excerpt :

      The rules provided by Titze42 were used to determine all other parameters. Additional details about model parameters are presented by Story.43 Ninety simulated samples were produced by the male and female versions of the computational models, for a total of 180 samples.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Bernd Moebius.

    View full text