Stochastic models of glottal pulses from the Rosenberg and Liljencrants-Fant models with unified parameters

doi:10.1016/j.csl.2021.101225

Computer Speech & Language

Volume 69, September 2021, 101225

https://doi.org/10.1016/j.csl.2021.101225 Get rights and content

Highlights

•
Stochastic models proposed to generate jitter.
•
Stochastic models based on the Rosenberg model and the LF model unified.
•
The variation of the glottal time interval modelled as a stochastic process.
•
Generation of intelligible voiced sounds.
•
Generation of voice signals with jitter and, consequently, hoarse voices.

Abstract

In voice production the random variation of the glottal cycles in relation to a mean value caused by the (quasi-) periodic movement of the vocal folds generates the jitter phenomenon. Its study is important due to applications as identification of pathologies related to the voice, improvement of the naturalness of synthesized voices and calibration of signal processing algorithms to identify glottal cycles. The objective of this work is to build stochastic models for glottal signals, using the Rosenberg and Liljencrants-Fant models with unified parameters, and considering the glottal time interval as a random variable (total time from the opening up to the closing of the vocal folds in each cycle), simulating the situation where the stiffness of the vocal folds is a stochastic process. Four different models are proposed with generation of jitter phenomenon in glottal signals simulated. Intelligible voiced sounds are obtained, in the case of normal voices, signal with low level of jitter, and also in hoarse voices, when the levels of jitter are higher. All the sounds simulated are available to be listened.

Introduction

Human beings are capable of expressing their feelings, thoughts and wishes through the voice, which is an important means of communication for living in society. The generation of voice, in the case of voiced sounds, starts with the airflow coming from the lungs and forcing the vocal folds to oscillate. After passing through the glottis, where the vocal folds are located, air pulses are generated forming the so-called glottal signal, which is a (quasi-)periodic signal, representing the glottal flow, which will be further filtered and amplified by the vocal tract (portion from the glottis up to the mouth), and finally radiated by the mouth and then generating the sound.

The small random variations in the glottal cycles, due to the asymmetric movement of the vocal folds, is a random phenomenon called jitter. There are different measures of jitter, but one of them, called local jitter, gives the relative percentage of the disturbances in relation to a mean value and its typical value for normal voices is around 1% (Mongia, Sharma, 2014, Wong, Ito, Cox, Titze, 1991). It has been verified empirically that vocal jitter increases for some dysphonic voices (Pinto, Titze, 1990, Schoentgen, De Guchteneere, 1995).

Models of jitter can be used to improve naturalness (even if voices with pathological characteristics are considered) to simulate hoarse voices (Bangayan et al., 1997), to generate voice signals to calibrate signal processing algorithms and to help in detecting glottal cycles (Muta et al., 1988). In addition, models of jitter could suggest or confirm the mathematical form of markers that would characterize perturbed cycle lengths statistically rather than heuristically.

Schoentgen (2001) presented stochastic models of jitter through the simulation of instantaneous frequency causing perturbations of the glottal cycles. Ruinskiy and Lavner (2008) presented an algorithm to transform a normal voice in a hoarse voice. Using spring-mass-damper mechanical models of the vocal folds, Cataldo, Soize, 2016, Cataldo, Soize, 2018 considered a stochastic model for the stiffness of the spring to generate jitter.

In this paper, a voice modification algorithm for transforming a modal voice to a hoarse voice is presented. The algorithm is based on creating jitter, which is known as a random phenomenon. Two mathematical deterministic models of the glottal signal are considered, the Rosenberg model and the Liljencrants-Fant model (LF model) with unified parameters (Henrich et al., 2002) and, taking into account the time variation of glottal intervals as a stochastic process, stochastic models of glottal signals are proposed, with generation of jitter. The second model, the LF model, is more sophisticated and it is a better approach to the real glottal signal. Then, through the Fant source-filter theory, voice signals with jitter are synthesized. Two different power spectral densities (PSDs) are associated to the proposed stochastic models and, consequently, four different models are created for the glottal signals. Simulations are performed and voiced sounds with different levels of jitter obtained. A comparison between the models is performed.

It is important to say that, in general, the quality of the signal generated is related to the mathematical model of the pulse used (Rosenberg or LF), to the bandwidths of the formant frequencies and also to the level of jitter.

In real voices, the jitter is always presented. Then, the naturalness of synthesized voices is better obtained when the phenomenon is included, not only for normal voices but also with hoarse voices, with a greater level of jitter, which can be also associated to pathological cases.

With the model presented, it is possible to generate the random phenomenon that is present in all voice signals. By numerical simulation, a big dataset can be generated for different voice signals, with different levels of jitter, and for different kinds of pathologies. Such a big dataset can be used for training an artificial neural network. The use of artificial neural networks to classify voices with pathological characteristics has been used by some researchers (Mohammed, Adulkareem, Mostafa, Maashi, Zapirain, Oleagordia, Al-Dhief, Alhakami, 2020, Megala, Padmapriya, Jayanthi, Suganya, 2019, Souissi, Cherif, 2016, Lotfi, Mourad, Cherif, 2010).

The proposed stochastic model makes it possible to obtain good speech synthesis with jitter. Although jitter is not a particularly robust clinical measure, it is one of the main parameters used, and this is the main reason for focusing in it. With this stochastic model, it will therefore be possible, in a second phase, to classify pathologies, using also other parameters, as presented in different works in the literature. Some of them are very recent (Bennane, Kacha, 2017, Schoentgen, Aichinger, 2019, Lucero, Pelorson, Hirtum, 2020, Asiaee, Vehdian-azimi, Atashi, Keramatfar, Nourbaksh, 2020, Wei, Du, Wang, 2020, Chiaramonte, Bonfiglio, 2020, Li, Hou, Zhang, Jiang, Gong, 2021).

The proposed model shows a way to create jitter in a voice signal using a consistent mathematical basis, and not simply adding a noise to the signal. Moreover, there is an interesting and robust mathematical formulation to generate jitter. Different power spectral density functions can be tested and the corresponding inverse problem solved to identify parameters related to a specific voice or group of voices, including voices with pathological characteristics.

The ideas discussed here are based on earlier studies about stochastic models of jitter, including not only signal processing but also mechanical models to generate jitter, and these models are based on the fact that the stiffness of the vocal folds is a stochastic process and, as consequence, the jitter is produced. In terms of signal processing, the idea is to reproduce the modelling of the stiffness as a stochastic process but considering the glottal time interval as a stochastic process.

The synthesized sounds are available to the readers (the corresponding link will be made available later in the paper) and it can be verified the intelligibility of the sounds created, corresponding to normal voices and also to hoarse voices, which could characterize any type of pathology.

Section snippets

The source-filter theory

The complete model presented here is based on the source-filter Fant theory (Fant, 1960), illustrated in Fig. 1.

The voice signal generated, $s (t),$ is given by the convolution of the glottal signal $u_{g} (t),$ the corresponding impulse response function $h_{v} (t)$ of the filter that models the vocal tract, and the radiation by the mouth for which the impulse response function is $h_{r} (t)$ . Eq. (1) is then obtained: $s (t) = (h_{r} * (h_{v} * u_{g})) (t),$ or, in the frequency domain, Eq. (2) is obtained $\hat{s} (ω) = \hat{h_{r}} (ω) \hat{h_{v}} (ω) \hat{u_{g}} (ω),$

The jitter phenomenon

Jitter is a random phenomenon present in all human voices caused by the small perturbation of the instantaneous fundamental frequency. It is an acoustic characteristic with important applications in studies related to the voice, as identification of pathologies, identification of voice aging, voice recognition, speaker recognition, and other (Wilcox, Horii, 1980, Li, Saigusa, Hakazawa, 2010, Mendonza, Vellasco, Cataldo, Silva, Apolinario, 2014). Different measures of jitter can be used.

Glottal signal stochastic models

For each deterministic model discussed, the Rosenberg model and the LF model, the time variation corresponding to the glottal time interval will be considered a stochastic process. And, for each one of the models, two different power spectral densities are associated to each stochastic process. Then, four complete glottal pulse models are considered. They will be described as follows.

Let $T_{g}$ be the glottal time interval; that is, the time interval associated to a complete glottal cycle. In the

Results

The objective of this section is to show the results obtained with the generation of the voice signals with jitter using the stochastic models proposed. The voice signals are generated according to Eq. (1) and four cases of glottal pulse models are considered: (i) Rosenberg model and PSD ( $S_{X}$ ) with two parameters; (ii) Rosenberg model and PSD ( $S_{X}$ ) with three parameters; (iii) LF model and PSD ( $S_{X}$ ) with two parameters; and (iv) LF model and PSD ( $S_{X}$ ) with three parameters.

Conclusions

Voice signals with jitter were synthesized with the proposed stochastic models, considering normal voices, with low level of jitter, and voice signals with high levels of jitter, characterizing hoarse voices, or the indication of any type of pathology, depending on the level of jitter.

The LF model is more sophisticated and the shape of the glottal pulse causes a smoother voice sound than the one generated with the Rosenberg model. The LF model associated with a power spectral density function

Declaration of Competing Interest

None.

Acknowledgments

This work was supported by the brazilian agence CNPq (Conselho Nacional de Pesquisa e Desenvolvimento).

References (37)

M. Asiaee et al.
Voice quality evaluation in patients with COVID-19: an acoustic analysis
J. Voice
(2020)
P. Bangayan et al.
Analysis by synthesis of pathological voices using the Klatt synthesizer
Speech Commun.
(1997)
Y. Bennane et al.
Synthesis of pathological voices and experiments on the effect of jitter and shimmer in voice quality perception
Proceedings of the 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B)
(2017)
E. Cataldo et al.
Jitter generation in voice signals produced by a two-mass stochastic mechanical model
Biomed. Signal Process. Control
(2016)
E. Cataldo et al.
Stochastic mechanical model of vocal folds for producing jitter and for identifying pathologies through real voices
J. Biomech.
(2018)
R. Chiaramonte et al.
Acoustic analysis of voice in parkinsons disease: a systematic review of voice disability and meta-analysis of studies
Rev. Neurol.
(2020)
B. Doval et al.
Spectral correlates of glottal waveform models : an analytic study
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany
(1997)
G. Fant
The acoustic theory of speech production
(1960)
G. Fant
Vocal source analysis - a progress report
STLQPSR
(1979)
G. Fant et al.
A four parameter model of glottal flow
STL-QPSR, No. 4
(1985)

N. Henrich et al.

Glottal Flow Models: Waveforms, Spectra and Physical Measurements

(2002)

D.J. Higham

An algorithimc introduction to numerical simulation of stochastic differential equations

SIAM Rev. Soc. Ind. Appl. Math.

(2001)

L.C. Klatt

Analysis, synthesis, and perception of voice quality variations among female and male talkers

J. Acoust. Soc. Am.

(1990)

P. Krée et al.

Mathematics of Random Phenomena

(1986)

G. Li et al.

Acoustic parameters for the evaluation of voice quality in patients with voice disorders

Ann. Palliat. Med.

(2021)

L. Li et al.

A pathological study of bamboo nodule of the vocal fold

J. Voice

(2010)

S. Lotfi et al.

Voice disorders identification using multilayer neural network

Int. Arab J. Inf. Technol.

(2010)

J. Lucero et al.

Phonation threshold pressure at large asymmetries of the vocal folds

Biomed. Signal Process. Control

(2020)

Cited by (3)

Adapted Weighted Linear Prediction with Attenuated Main Excitation for formant frequency estimation in high-pitched singing
2024, Speech Communication
This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study proposes adapting the Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME) method for formant frequency estimation. Specific methods for glottal closure instant estimation were required due to differences in glottal closure patterns between speech and singing. The study evaluates the accuracy of the proposed method by comparing its performance with the LPC method through different pitch series arranged in an ascending musical scale. The results indicated that the adapted WLP-AME method consistently outperformed the LPC method in estimating formant frequencies of vowels sung by a lyric soprano. In addition, by estimating the formant frequencies of a synthetic /i/ vowel sung by a soprano singer at the musical note E5, the study showed that the adapted WLP-AME method provided formant frequency values closer to the correct values than those estimated by the LPC method. In general, these results suggest parameter values of AME function that optimize its performance, which can have applications in fields such as singing and medicine.
Estimating Formant Frequencies of Vowels Sung by Sopranos Using Weighted Linear Prediction
2023, Journal of Voice
This study introduces the weighted linear prediction adapted to high-pitched singing voices (WLP-HPSV) method for accurately estimating formant frequencies of vowels sung by lyric sopranos. The WLP-HPSV method employs a variant of the WLP analysis combined with the zero-frequency filtering (ZFF) technique to address specific challenges in formant estimation from singing signals. Evaluation of the WLP-HPSV method compared to the LPC method demonstrated its superior performance in accurately capturing the spectral characteristics of synthetic /u/ vowels and the /a/ and /u/ natural singing vowels. The QCP parameters used in the WLP-HPSV method varied with pitch, revealing insights into the interplay between the vocal tract and glottal characteristics during vowel production. The comparison between the LPC and WLP-HPSV methods highlighted the robustness of the WLP-HPSV method in accurately estimating formant frequencies across different pitches.
Adapted Weighted Linear Prediction with Attenuated Main Excitation for Formant Frequency Estimation of Spanish Vowels in Soprano Singing Voice
2023, SSRN

View full text

Stochastic models of glottal pulses from the Rosenberg and Liljencrants-Fant models with unified parameters

Highlights

Abstract

Introduction

Section snippets

The source-filter theory

The jitter phenomenon

Glottal signal stochastic models

Results

Conclusions

Declaration of Competing Interest

Acknowledgments

Voice quality evaluation in patients with COVID-19: an acoustic analysis

J. Voice

Analysis by synthesis of pathological voices using the Klatt synthesizer

Speech Commun.

Synthesis of pathological voices and experiments on the effect of jitter and shimmer in voice quality perception

Proceedings of the 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B)

Jitter generation in voice signals produced by a two-mass stochastic mechanical model

Biomed. Signal Process. Control

Stochastic mechanical model of vocal folds for producing jitter and for identifying pathologies through real voices

J. Biomech.

Acoustic analysis of voice in parkinsons disease: a systematic review of voice disability and meta-analysis of studies

Rev. Neurol.

Spectral correlates of glottal waveform models : an analytic study

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany

The acoustic theory of speech production

Vocal source analysis - a progress report

STLQPSR

A four parameter model of glottal flow

STL-QPSR, No. 4

Glottal Flow Models: Waveforms, Spectra and Physical Measurements

An algorithimc introduction to numerical simulation of stochastic differential equations

SIAM Rev. Soc. Ind. Appl. Math.

Analysis, synthesis, and perception of voice quality variations among female and male talkers

J. Acoust. Soc. Am.

Mathematics of Random Phenomena

Acoustic parameters for the evaluation of voice quality in patients with voice disorders

Ann. Palliat. Med.

A pathological study of bamboo nodule of the vocal fold

J. Voice

Voice disorders identification using multilayer neural network

Int. Arab J. Inf. Technol.

Phonation threshold pressure at large asymmetries of the vocal folds

Biomed. Signal Process. Control