Phrase-level speech simulation with an airway modulation model of speech production

doi:10.1016/j.csl.2012.10.005

Computer Speech & Language

Volume 27, Issue 4, June 2013, Pages 989-1010

https://doi.org/10.1016/j.csl.2012.10.005 Get rights and content

Abstract

Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated.

Highlights

► An airway modulation model of speech production was developed for simulating speech. ► Model-based simulations are shown for two words and two phrases. ► Audio samples and animations of vocal tract movement are included for each simulation.

Introduction

Speech is produced by transforming the motion of anatomical structures into an acoustic wave embedded with the distinctive characteristics of speech. This transformation can be conceived as a modulation of the human airway system on multiple time scales. For example, the rapid vibration of the vocal folds modulates the airspace between them (i.e., the glottis) on the order of 100–400 cycles per second to generate a train of flow pulses that excites the acoustic resonances of the trachea, vocal tract, and nasal passages. Simultaneous, but much slower movements of the tongue, jaw, lips, velum, and larynx can be executed to modulate the shape of the pharyngeal and oral cavities, coupling to the nasal system, and space between the vocal folds by adduction and abduction maneuvers. These relatively slow modulations shift the acoustic resonances up or down in frequency and valve the flow of air through the system, thus altering the characteristics of the radiated acoustic wave over time, and providing the stimulus from which listeners can extract phonetic information.

The view that human speech is produced by a modulation system was expressed by Dudley (1940) in an article called “The carrier nature of speech.” In it he referred to the relatively high-frequency excitation provided by phonation or noise generation as “carrier waves” that are modulated by slowly-varying, and otherwise inaudible, movements of the vocal tract called “message waves.” He based this view on experience in developing both the vocoder (Dudley, 1939) and the human-operated voder (Dudley et al., 1939), and, in the conclusion, made a curious point that a wide variety of carrier signals – even nonhuman sounds such as instrumental music – could be modulated by the “message waves” and still produce intelligible “speech.” This points to the importance of understanding articulatory movement in terms of how it modulates the shape of pharyngeal and oral airspaces over time which, in turn, modulates the acoustic characteristics of the speech signal. Traunmüller (1994) also proposed a modulation theory in which speech signals are considered to be the result of articulatory gestures, common across speakers, that modulate a “carrier” signal unique to the speaker. In this theory, however, the carrier signal is not simply the excitation signal, but includes any aspects of the system that are phonetically neutral and descriptive of the “personal quality” of the speaker. This suggests that embedded within the carrier would be contributions of the biological structure of the vocal tract as well as any idiosyncratic vocal tract shaping patterns, all of which would be modulated during speech production by linguistically meaningful gestures.

Studying speech as a modulation system can be aided by models that allow for enough control of relevant parameters to generate speech or speech-like sounds. Within such models, the shape of the trachea, vocal tract, and nasal passages is usually represented as a tubular system, quantified by a set of area functions (cf., Fant, 1960, Baer et al., 1991, Story et al., 1996). This permits computing the acoustic wave propagation through the system with one-dimensional methods in the time domain (cf., Kelly and Lochbaum, 1962, Maeda, 1982, Strube, 1982, Liljencrants, 1985, Smith, 1992) or frequency domain (Sondhi and Schroeter, 1987). Typically for speech, only the vocal tract portion, along with the nasal coupling region, is considered to be time-varying. Thus, the challenge for developing a model that can “speak” is to define a set of parameters that allow efficient, time-dependent control of the shape of the vocal tract area function and coupling to the nasal system.

An articulatory synthesizer is perhaps the most intuitively-appealing approach to controlling the vocal tract because the model parameters consist of positions and movements of the tongue, jaw, lips, velum, etc. These are often represented in the two-dimensional midsagittal plane (cf., Lindblom and Sundberg, 1971, Mermelstein, 1973, Coker, 1976, Maeda, 1990, Scully, 1990) or as more complex three-dimensional models of articulatory structures (Dang and Honda, 2004, Birkholz et al., 2006, Birkholz et al., 2007), where, in either case, articulatory motion can be simulated by specifying the temporal variation of the model parameters. At any given instant of time, however, the articulatory configuration must be converted to an area function by empirically-based rules in order to calculate acoustic wave propagation, and ultimately produce an acoustic speech signal suitable for analysis or listening (e.g., Rubin et al., 1981, Birkholz et al., 2010, Bauer et al., 2010).

Other approaches consist of parameterizing the area function directly, rather than attending specifically to the anatomical structures. These are particularly useful when precise control of the vocal tract shape is desired. Early examples of this approach are the three-parameter models of Fant (1960) and Stevens and House (1955) in which the area function was described by a parabola controlled by a primary constriction location, cross-sectional area at that location, and a ratio of lip opening length to its area. These models were later modified to include various enhancements (Atal et al., 1978, Lin, 1990, Fant, 1992, Fant, 2001). Another type of area function model was proposed by Mrayati et al. (1988). The model parameters were not directly related to articulation but rather to portions of the area function determined to be acoustically-sensitive to changes in cross-sectional area.

Using any of these types of models to produce connected, coarticulated speech requires that the parameters allow for blending of the vowel and consonant contributions to the vocal tract shape. Öhman, 1966, Öhman, 1967 suggested that a consonant gesture (localized constriction) is superimposed on an underlying vowel substrate, rather than considering consonant and vowel to be separate, linearly sequenced gestures. Based on this view, Båvegård, 1995, Fant and Båvegård, 1997 detailed an area function model in which the vowel contribution was represented by a three-parameter model (Fant, 1992), as mentioned previously, and a consonant constriction function that could be superimposed on the vowel configuration to alter the shape of the area function at a particular location. Ohman's notion of vowel and consonant overlap has also influenced theoretical views of speech motor control. Gracco (1992), for instance, suggested that the vocal tract be considered the smallest unit of functional behavior for speech production, and that the movements of the vocal tract could be classified into “shaping” and “valving” actions. Relatively slow changes of overall vocal tract geometry coincide with the shaping category and would generally be associated with vowel production, whereas valving actions would impose and release localized constrictions, primarily for consonants.

Story (2005a) introduced an area function model conceptually similar to that of Båvegård, 1995, Fant and Båvegård, 1997. That is, the model operates under the assumption that consonantal, or more accurately, obstruent-like constrictions can be superimposed on an underlying vowel-like area function to momentarily produce an occlusion or partial occlusion at some location in the vocal tract. It is different, however, in that the vowel substrate is generated by superimposing two shaping patterns, called modes (Story and Titze, 1998), on an otherwise neutral vocal tract configuration. Thus, at any instant in time, the shape of the area function and the subsequent acoustic output include contributions from multiple layers of modulation: (1) idiosyncratic characteristics of the neutral vocal tract, (2) the overall shaping influence of the modes, and (3) possible valving (constriction) functions that force some part of area function to become zero or nearly so. The framework for the model is supported by work based on analysis of data from both MRI and X-ray microbeam articulography (Westbury, 1994). Story, 2005b, Story, 2007 has shown that the two shaping modes seem to be fairly similar across talkers for vowel production, whereas, the mean or neutral vocal tract shape is generally unique to a talker. In addition, Story (2009b) has shown that a time-varying vocal tract shape representative of a VCV utterance can be separated into a vowel-like substrate and contributions from the obstruent-like constriction.

The purpose of this article is to describe an airway modulation model of speech production based on the parametric system reported in Story (2005a), and to demonstrate that it can be used to simulate word-level and phrase-level utterances. The components of the model consist of kinematic representations of the medial surfaces of the vocal folds (Titze, 1984, Titze, 2006) and the shape of vocal tract as an area function (Story, 2005a), as well as static representations of the nasal passages and sinuses, and trachea. The assembly of these components into a system that can generate simulated speech has been used recently to investigate the acoustics and perception of vowels, consonants, and various voice qualities (Bunton and Story, 2009, Bunton and Story, 2010, Bunton and Story, 2011, Story and Bunton, 2010, Samlan and Story, 2011). These studies, however, have focused on simple utterances such as isolated vowels and VCVs which are fairly straightforward to simulate with systematic variation of model parameters. The aim in this article is to present a pilot study of several cases in which the model parameters are varied in more complex fashion to simulate a small collection of words and phrases. In the first part, the background and main components of the model are briefly described. The use of the model to generate words and phrases will be demonstrated in the second part.

Section snippets

Airway modulation model

The airway modulation model is constructed such that a baseline configuration of the laryngeal, vocal tract, and nasal systems would, without any imposed control, produce a neutral, monotone vowel sound. The model parameters can then be activated such that they alter the baseline configuration in some desired manner such that the acoustic properties of the generated signal is changed. The parameters of the model are controlled by a set of hierarchical tiers as shown in Fig. 1. The leftmost

Simulation of word-level and phrase-level speech

In this section, two words and two phrases simulated with the TubeTalker system are demonstrated graphically and with multimedia content. The words are “Ohio” and “Abracadabra”, and the phrases are “He had a rabbit” and “The brown cow.” These were chosen to show how the various components of the TubeTalker model can be used to produce a range of speech sounds in a connected speech context. The word Ohio provides a case where the change in vocal tract shape is based only on vowels, and the

Conclusion

An airway modulation model called TubeTalker was introduced as a system for generating artificial speech. The overall goal in developing the model is to facilitate an understanding of how modulations of the basic structure of the glottis and vocal tract are acoustically encoded in the time variation of the speech signal, and perceptually decoded by a listener into phonetic elements. The model encodes the speech signal by assuming that: (1) an acoustically neutral state of the vocal tract

Acknowledgments

This research was supported by NIH R01 DC04789 and NIH R01 DC011275. A preliminary version of this work was presented at the 2011 International Workshop on Performative Speech and Singing Synthesis in Vancouver, BC.

References (59)

H. Dudley et al.
A synthetic speaker
Journal of the Franklin Institute-Engineering and Applied Mathematics
(1939)
S. Maeda
A digital simulation method of the vocal-tract system
Speech Communication
(1982)
M. Mrayati et al.
Distinctive regions and modes: a new theory of speech production
Speech Commnication
(1988)
B.H. Story et al.
Acoustic impedance of an artificially lengthened and constricted vocal tract
Journal of Voice
(2000)
B.H. Story et al.
Parameterization of vocal tract area functions by empirical orthogonal modes
Journal of Phonetics
(1998)
B.H. Story et al.
A preliminary study of voice quality transformation based on modifications to the neutral vocal tract area function
Journal of Phonetics
(2002)
B.S. Atal et al.
Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique
Journal of the Acoustical Society of America
(1978)
T. Baer et al.
Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels
Journal of the Acoustical Society of America
(1991)
D. Bauer et al.
Evaluation of articulatory speech synthesis: a perception study
M. Båvegård
Introducing a parametric consonantal model to the articulatory speech synthesizer

P. Birkholz et al.

Construction and control of a three-dimensional vocal tract model

P. Birkholz et al.

Simulation of losses due to turbulence in the time-varying vocal system

IEEE Transactions on Audio, Speech, and Language Processing

(2007)

P.D. Birkholz et al.

Articulatory synthesis and perception of plosive-vowel syllables with virtual consonant targets

K. Bunton et al.

Identification of synthetic vowels based on selected vocal tract area functions

Journal of the Acoustical Society of America

(2009)

K. Bunton et al.

Identification of synthetic vowels based on a time-varying model of the vocal tract area function

Journal of the Acoustical Society of America

(2010)

K. Bunton et al.

The relation of nasality and nasalance to nasal port area based on a computational model

The Cleft Palate-Craniofacial Journal

(2011)

C.H. Coker

A model of articulatory dynamics and control

Proceedings of IEEE

(1976)

J. Dang et al.

Construction and control of a physiological articulatory model

Journal of the Acoustical Society of America

(2004)

H. Dudley

Remaking speech

Journal of the Acoustical Society of America

(1939)

H. Dudley

The carrier nature of speech

Bell System Technical Journal

(1940)

G. Fant

The Acoustic Theory of Speech Production

(1960)

G. Fant

Vocal tract area functions of Swedish vowels and a new three-parameter model

G. Fant

Swedish vowels and a new three-parameter model

TMH-QPSR

(2001)

G. Fant et al.

Parametric model of VT area functions: vowels and consonants

TMH-QPSR

(1997)

J.L. Flanagan et al.

Excitation of vocal-tract synthesizers

Journal of the Acoustical Society of America

(1969)

V.L. Gracco

Characteristics of speech as a motor control system

The Haskins Laboratories Status Report on Speech Research

(1992)

J.L. Kelly et al.

Speech synthesis

T.I. Laakso et al.

Splitting the unit delay

IEEE Transactions on Signal Processing Magazine

(1996)

Liljencrants, J., 1985. Speech synthesis with a reflection-type line analog. DS Dissertation. Dept. of Speech Comm. and...

Cited by (62)

The Effects of Remote Signal Transmission and Recording on Acoustical Measures of Simulated Essential Vocal Tremor: Considerations for Remote Treatment Research and Telepractice
2024, Journal of Voice
Citation Excerpt :
The rules provided by Titze42 were used to determine all other parameters. Additional details about model parameters are presented by Story.43 Ninety simulated samples were produced by the male and female versions of the computational models, for a total of 180 samples.
Studies on medical and behavioral interventions for essential vocal tremor (EVT) have shown inconsistent effects on acoustical and perceptual outcome measures across studies and across participants. Remote acoustical and perceptual assessments might facilitate studies with larger samples of participants and repeated measures that could clarify treatment effects and identify optimal treatment candidates. Furthermore, remote acoustical and perceptual assessment might allow clinicians to monitor clients’ treatment responses and optimize treatment approaches during telepractice. Thus, the purpose of this study was to evaluate the accuracy of remote signal transmission and recording for acoustical and perceptual assessment of EVT.
Simulations of EVT were produced using a computational model and were recorded using local and remote procedures to represent client- and clinician-end recordings respectively. Acoustical analyses measured the extent and rate of fundamental frequency (f_o) and intensity modulation to represent vocal tremor severity and the cepstral peak prominence (CPPS) to represent voice quality. The data were analyzed using repeated measures analysis of variance (ANOVA) with recording as the within-subjects factor and sex of the computational model as the between-subjects factor.
There was a significant main effect of recording on the rate of f_o modulation and significant interactions of recording and sex for the extent of intensity modulation, rate of intensity modulation, and CPPS. Posthoc pairwise comparisons and analysis of effect size indicated that recording procedures had the largest effect on the extent of intensity modulation for male simulations, the rate of intensity modulation for male and female simulations, and the CPPS for male and female simulations. Despite having disabled all known software and computer audio enhancing options and having stable ethernet connections, there was inconsistent attenuation of signal amplitude in remote recordings that was most problematic for samples with a breathy voice quality but also affected samples with typical and pressed voice qualities.
Acoustical measures that correlate to perception of vocal tremor and voice quality were altered by remote signal transmission and recording. In particular, signal transmission and recording in Zoom altered time-based estimates of intensity modulation and CPPS with male and female simulations of EVT and magnitude-based estimates of intensity modulation with male simulations of EVT. In contrast, signal transmission and recording in Zoom minimally altered time- and magnitude-based estimates of f_o modulation with male and female simulations of EVT. Therefore, acoustical and perceptual assessments of EVT should be performed using audio recordings that are collected locally on the participant- or client-end, particularly when measuring modulation of intensity and CPP or estimating vocal tremor severity and voice quality. Development of procedures for collecting local audio recordings in remote settings may expand data collection for treatment research and enhance telepractice.
The Performance of the Acoustic Voice Quality Index and Acoustic Breathiness Index in Synthesized Voices
2023, Journal of Voice
The aim of the present study was to investigate the performance of the Acoustic Voice Quality Index (AVQI) and the Acoustic Breathiness Index (ABI) in synthesized voice samples.
The validity of the AVQI and ABI performances was analyzed in synthesized voice samples controlling the degree of predefined deviations for overall voice quality (G-scale) and breathiness (B-scale). A range of 26 synthesized voice samples with various severity degrees in G-scale with and without prominence of breathiness for male and female voices were created.
ABI received higher validity in the evaluation of breathiness than AVQI. Furthermore, ABI evaluated accurately breathiness degrees without considering roughness effects in voice samples and confirmed the findings of other studies with natural voices. Furthermore, ABI was more robust than AVQI in the evaluation of severe voice-disordered voice samples. Finally, AVQI represented moreover overall voice quality with an emphasis of breathiness evaluation and less roughness although roughness had a necessary component in overall voice quality evaluation.
AVQI and ABI are two robust measurements in the evaluation of voice quality. However, ABI received fewer errors than AVQI in the analyses of higher abnormalities in the voice signal. Disturbances of other subtypes of abnormal overall voice quality such as roughness were not demonstrated in the results of ABI.
Degree of Breathiness in a Synthesized Voice Signal as it Differentiates Masculine versus Feminine Voices
2023, Journal of Voice
Most studies determining speakers’ perceived gender as binarily female or male are reliant on F0 perception, although other vocal parameters may also contribute to the perception of gender. The current study focused on the impact of breathiness on the perception of speakers’ gender as a biological variable (feminine or masculine).
n = 31 normal hearing, native English speakers, 18 female, 13 male, mean age 23 (SD = 3.54), were auditorily and visually trained in and then took part in a categorical perception task. A continuum of nine samples of the word “hello”, was created in an airway modulation model of speech and voice production. Resting vocal fold length, resting vocal fold thickness, F0, and vocal tract length were fixed. Glottal width at the vocal process, posterior glottal gap, and bronchial pressure were continually modified for all stimuli. Each stimulus was randomly presented 30 times within each of the five blocks (150 presentations in total). Participants rated stimuli as binarily female or male.
Showed a sigmoidal shift in breathiness along the continuum between perceived feminine or masculine voicing. This shift was evident at stimuli four and five, indicating a nonlinear, discrete perception of breathiness among participants. Response times were also significantly slower in these two stimuli, suggesting a categorical perception of breathiness among participants.
Breathiness created by the change in glottal width of at least 0.21 cm may influence the perception of a speaker’s perceived gender.
Acoustical Theory of Vowel Modification Strategies in Belting
2023, Journal of Voice
Various authors have argued that belting is to be produced by “speech-like” sounds, with the first and second supraglottic vocal tract resonances ( $f_{R 1}$ and $f_{R 2}$ ) at frequencies of the vowels determined by the lyrics to be sung. Acoustically, the hallmark of belting has been identified as a dominant second harmonic, possibly enhanced by first resonance tuning ( $f_{R 1} \approx 2 f_{o}$ ). It is not clear how both these concepts – (a) phonating with “speech-like,” unmodified vowels; and (b) producing a belting sound with a dominant second harmonic, typically enhanced by $f_{R 1}$ – can be upheld when singing across a singer’s entire musical pitch range. For instance, anecdotal reports from pedagogues suggest that vowels with a low $f_{R 1}$ , such as [i] or [u], might have to be modified considerably (by raising $f_{R 1}$ ) in order to phonate at higher pitches. These issues were systematically addressed in silico with respect to treble singing, using a linear source-filter voice production model. The dominant harmonic of the radiated spectrum was assessed in 12987 simulations, covering a parameter space of 37 fundamental frequencies ( $f_{o}$ ) across the musical pitch range from C3 to C6; 27 voice source spectral slope settings from $-$ 4 to $-$ 30 dB/octave; computed for 13 different IPA vowels. The results suggest that, for most unmodified vowels, the stereotypical belting sound characteristics with a dominant second harmonic can only be produced over a pitch range of about a musical fifth, centered at $f_{o} \approx 0.5 f_{R 1}$ . In the [ɔ] and [ɑ] vowels, that range is extended to an octave, supported by a low second resonance. Data aggregation – considering the relative prevalence of vowels in American English – suggests that, historically, belting with $f_{R 1} \approx 2 f_{o}$ was derived from speech, and that songs with an extended musical pitch range likely demand considerable vowel modification. We thus argue that – on acoustical grounds – the pedagogical commandment for belting with unmodified, “speech-like” vowels can not always be fulfilled.
Improved vocal tract model for the elongation of segment lengths in a real time
2019, Computer Speech and Language
In the current work, an efficient approach has been implemented to model the fractional delay in the elongated cylindrical segments of the vocal tract in waveguide modeling. The vocal tract has been divided into uniform cylindrical segments of the variable lengths. In this case, the time taken by the sound wave to propagate through a cylindrical segment in an axial direction may not be an integer multiple of each other. This means that the delay in an axial direction is necessarily a fractional delay for each fractional elongated segment. In the previous work, to accommodate the fractional delay for each elongated cylindrical segment, two extra cylindrical segments of same lengths were added to maintain the even number of segments. In the proposed work, we add only a single extra segment for each fractional elongated segment which reduces memory and computational cost as well. To keep the even number of segments, we assume that the extra and original segments constitute a single long segment. Lagrange interpolation is used for the approximation of the fractional delay. The proposed model has been devised for the elongation of any arbitrary cylindrical segment by a suitable scaling factor. These results are validated with an accurate benchmark model. This model has a single algorithm and there is no need to make sections of the segments for the elongation of the intermediate segments.
OPENGLOT – An open environment for the evaluation of glottal inverse filtering
2019, Speech Communication
Glottal inverse filtering (GIF) refers to technology to estimate the source of voiced speech, the glottal flow, from speech signals. When a new GIF algorithm is proposed, its accuracy needs to be evaluated. However, the evaluation of GIF is problematic because the ground truth, the real glottal volume velocity signal generated by the vocal folds, cannot be recorded non-invasively from natural speech. This absence of the ground truth has been circumvented in most previous GIF studies by using simple linear source-filter synthesis techniques with known artificial glottal flow models and all-pole vocal tract filters. Moreover, in a few previous studies, physical modeling of speech production has been utilized in synthesis of the test data for GIF evaluation. The evaluation strategy in previous GIF studies is, however, scattered between individual investigations and there is currently a lack of a coherent, common platform to be used in GIF evaluation. In order to address this shortcoming, the current study introduces a new environment, called OPENGLOT, for GIF evaluation. The key ideas of OPENGLOT are twofold: the environment is versatile (i.e., it provides different types of test signals for GIF evaluation) and open (i.e., the system can be used by anyone who wants to evaluate her or his new GIF method and compare it objectively to previously developed benchmark techniques). OPENGLOT consists of four main parts, Repositories I–IV, that contain data and sound synthesis software. Repository I contains a large set of synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an artificial excitation, and a digital all-pole filter to model the vocal tract. Repository II contains glottal flow and speech pressure signals generated using physical modeling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, electroglottogram, high-speed video of the vocal folds) from natural production of speech. After presenting these four core parts of OPENGLOT, the article demonstrates the platform by presenting a typical use case.

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Bernd Moebius.

View full text

Phrase-level speech simulation with an airway modulation model of speech production☆

Abstract

Highlights

Introduction

Section snippets

Airway modulation model

Simulation of word-level and phrase-level speech

Conclusion

Acknowledgments

Journal of the Franklin Institute-Engineering and Applied Mathematics

Speech Communication

Speech Commnication

Journal of Voice

Journal of Phonetics

Journal of Phonetics

Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique

Journal of the Acoustical Society of America

Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels

Journal of the Acoustical Society of America

Evaluation of articulatory speech synthesis: a perception study

Introducing a parametric consonantal model to the articulatory speech synthesizer

Construction and control of a three-dimensional vocal tract model

Simulation of losses due to turbulence in the time-varying vocal system

IEEE Transactions on Audio, Speech, and Language Processing

Articulatory synthesis and perception of plosive-vowel syllables with virtual consonant targets

Identification of synthetic vowels based on selected vocal tract area functions

Journal of the Acoustical Society of America

Identification of synthetic vowels based on a time-varying model of the vocal tract area function

Journal of the Acoustical Society of America

The relation of nasality and nasalance to nasal port area based on a computational model

The Cleft Palate-Craniofacial Journal

A model of articulatory dynamics and control

Proceedings of IEEE

Construction and control of a physiological articulatory model

Journal of the Acoustical Society of America

Remaking speech

Journal of the Acoustical Society of America

The carrier nature of speech

Bell System Technical Journal

The Acoustic Theory of Speech Production

Vocal tract area functions of Swedish vowels and a new three-parameter model

Swedish vowels and a new three-parameter model

TMH-QPSR

Parametric model of VT area functions: vowels and consonants

TMH-QPSR

Excitation of vocal-tract synthesizers

Journal of the Acoustical Society of America

Characteristics of speech as a motor control system

The Haskins Laboratories Status Report on Speech Research

Speech synthesis

Splitting the unit delay

IEEE Transactions on Signal Processing Magazine