Elsevier

Computer Speech & Language

Volume 69, September 2021, 101225
Computer Speech & Language

Stochastic models of glottal pulses from the Rosenberg and Liljencrants-Fant models with unified parameters

https://doi.org/10.1016/j.csl.2021.101225Get rights and content

Highlights

  • Stochastic models proposed to generate jitter.

  • Stochastic models based on the Rosenberg model and the LF model unified.

  • The variation of the glottal time interval modelled as a stochastic process.

  • Generation of intelligible voiced sounds.

  • Generation of voice signals with jitter and, consequently, hoarse voices.

Abstract

In voice production the random variation of the glottal cycles in relation to a mean value caused by the (quasi-) periodic movement of the vocal folds generates the jitter phenomenon. Its study is important due to applications as identification of pathologies related to the voice, improvement of the naturalness of synthesized voices and calibration of signal processing algorithms to identify glottal cycles. The objective of this work is to build stochastic models for glottal signals, using the Rosenberg and Liljencrants-Fant models with unified parameters, and considering the glottal time interval as a random variable (total time from the opening up to the closing of the vocal folds in each cycle), simulating the situation where the stiffness of the vocal folds is a stochastic process. Four different models are proposed with generation of jitter phenomenon in glottal signals simulated. Intelligible voiced sounds are obtained, in the case of normal voices, signal with low level of jitter, and also in hoarse voices, when the levels of jitter are higher. All the sounds simulated are available to be listened.

Introduction

Human beings are capable of expressing their feelings, thoughts and wishes through the voice, which is an important means of communication for living in society. The generation of voice, in the case of voiced sounds, starts with the airflow coming from the lungs and forcing the vocal folds to oscillate. After passing through the glottis, where the vocal folds are located, air pulses are generated forming the so-called glottal signal, which is a (quasi-)periodic signal, representing the glottal flow, which will be further filtered and amplified by the vocal tract (portion from the glottis up to the mouth), and finally radiated by the mouth and then generating the sound.

The small random variations in the glottal cycles, due to the asymmetric movement of the vocal folds, is a random phenomenon called jitter. There are different measures of jitter, but one of them, called local jitter, gives the relative percentage of the disturbances in relation to a mean value and its typical value for normal voices is around 1% (Mongia, Sharma, 2014, Wong, Ito, Cox, Titze, 1991). It has been verified empirically that vocal jitter increases for some dysphonic voices (Pinto, Titze, 1990, Schoentgen, De Guchteneere, 1995).

Models of jitter can be used to improve naturalness (even if voices with pathological characteristics are considered) to simulate hoarse voices (Bangayan et al., 1997), to generate voice signals to calibrate signal processing algorithms and to help in detecting glottal cycles (Muta et al., 1988). In addition, models of jitter could suggest or confirm the mathematical form of markers that would characterize perturbed cycle lengths statistically rather than heuristically.

Schoentgen (2001) presented stochastic models of jitter through the simulation of instantaneous frequency causing perturbations of the glottal cycles. Ruinskiy and Lavner (2008) presented an algorithm to transform a normal voice in a hoarse voice. Using spring-mass-damper mechanical models of the vocal folds, Cataldo, Soize, 2016, Cataldo, Soize, 2018 considered a stochastic model for the stiffness of the spring to generate jitter.

In this paper, a voice modification algorithm for transforming a modal voice to a hoarse voice is presented. The algorithm is based on creating jitter, which is known as a random phenomenon. Two mathematical deterministic models of the glottal signal are considered, the Rosenberg model and the Liljencrants-Fant model (LF model) with unified parameters (Henrich et al., 2002) and, taking into account the time variation of glottal intervals as a stochastic process, stochastic models of glottal signals are proposed, with generation of jitter. The second model, the LF model, is more sophisticated and it is a better approach to the real glottal signal. Then, through the Fant source-filter theory, voice signals with jitter are synthesized. Two different power spectral densities (PSDs) are associated to the proposed stochastic models and, consequently, four different models are created for the glottal signals. Simulations are performed and voiced sounds with different levels of jitter obtained. A comparison between the models is performed.

It is important to say that, in general, the quality of the signal generated is related to the mathematical model of the pulse used (Rosenberg or LF), to the bandwidths of the formant frequencies and also to the level of jitter.

In real voices, the jitter is always presented. Then, the naturalness of synthesized voices is better obtained when the phenomenon is included, not only for normal voices but also with hoarse voices, with a greater level of jitter, which can be also associated to pathological cases.

With the model presented, it is possible to generate the random phenomenon that is present in all voice signals. By numerical simulation, a big dataset can be generated for different voice signals, with different levels of jitter, and for different kinds of pathologies. Such a big dataset can be used for training an artificial neural network. The use of artificial neural networks to classify voices with pathological characteristics has been used by some researchers (Mohammed, Adulkareem, Mostafa, Maashi, Zapirain, Oleagordia, Al-Dhief, Alhakami, 2020, Megala, Padmapriya, Jayanthi, Suganya, 2019, Souissi, Cherif, 2016, Lotfi, Mourad, Cherif, 2010).

The proposed stochastic model makes it possible to obtain good speech synthesis with jitter. Although jitter is not a particularly robust clinical measure, it is one of the main parameters used, and this is the main reason for focusing in it. With this stochastic model, it will therefore be possible, in a second phase, to classify pathologies, using also other parameters, as presented in different works in the literature. Some of them are very recent (Bennane, Kacha, 2017, Schoentgen, Aichinger, 2019, Lucero, Pelorson, Hirtum, 2020, Asiaee, Vehdian-azimi, Atashi, Keramatfar, Nourbaksh, 2020, Wei, Du, Wang, 2020, Chiaramonte, Bonfiglio, 2020, Li, Hou, Zhang, Jiang, Gong, 2021).

The proposed model shows a way to create jitter in a voice signal using a consistent mathematical basis, and not simply adding a noise to the signal. Moreover, there is an interesting and robust mathematical formulation to generate jitter. Different power spectral density functions can be tested and the corresponding inverse problem solved to identify parameters related to a specific voice or group of voices, including voices with pathological characteristics.

The ideas discussed here are based on earlier studies about stochastic models of jitter, including not only signal processing but also mechanical models to generate jitter, and these models are based on the fact that the stiffness of the vocal folds is a stochastic process and, as consequence, the jitter is produced. In terms of signal processing, the idea is to reproduce the modelling of the stiffness as a stochastic process but considering the glottal time interval as a stochastic process.

The synthesized sounds are available to the readers (the corresponding link will be made available later in the paper) and it can be verified the intelligibility of the sounds created, corresponding to normal voices and also to hoarse voices, which could characterize any type of pathology.

Section snippets

The source-filter theory

The complete model presented here is based on the source-filter Fant theory (Fant, 1960), illustrated in Fig. 1.

The voice signal generated, s(t), is given by the convolution of the glottal signal ug(t), the corresponding impulse response function hv(t) of the filter that models the vocal tract, and the radiation by the mouth for which the impulse response function is hr(t). Eq.  (1) is then obtained:s(t)=(hr*(hv*ug))(t),or, in the frequency domain, Eq. (2) is obtaineds^(ω)=hr^(ω)hv^(ω)ug^(ω),

The jitter phenomenon

Jitter is a random phenomenon present in all human voices caused by the small perturbation of the instantaneous fundamental frequency. It is an acoustic characteristic with important applications in studies related to the voice, as identification of pathologies, identification of voice aging, voice recognition, speaker recognition, and other (Wilcox, Horii, 1980, Li, Saigusa, Hakazawa, 2010, Mendonza, Vellasco, Cataldo, Silva, Apolinario, 2014). Different measures of jitter can be used.

Glottal signal stochastic models

For each deterministic model discussed, the Rosenberg model and the LF model, the time variation corresponding to the glottal time interval will be considered a stochastic process. And, for each one of the models, two different power spectral densities are associated to each stochastic process. Then, four complete glottal pulse models are considered. They will be described as follows.

Let Tg be the glottal time interval; that is, the time interval associated to a complete glottal cycle. In the

Results

The objective of this section is to show the results obtained with the generation of the voice signals with jitter using the stochastic models proposed. The voice signals are generated according to Eq.  (1) and four cases of glottal pulse models are considered: (i) Rosenberg model and PSD (SX) with two parameters; (ii) Rosenberg model and PSD (SX) with three parameters; (iii) LF model and PSD (SX) with two parameters; and (iv) LF model and PSD (SX) with three parameters.

Conclusions

Voice signals with jitter were synthesized with the proposed stochastic models, considering normal voices, with low level of jitter, and voice signals with high levels of jitter, characterizing hoarse voices, or the indication of any type of pathology, depending on the level of jitter.

The LF model is more sophisticated and the shape of the glottal pulse causes a smoother voice sound than the one generated with the Rosenberg model. The LF model associated with a power spectral density function

Declaration of Competing Interest

None.

Acknowledgments

This work was supported by the brazilian agence CNPq (Conselho Nacional de Pesquisa e Desenvolvimento).

References (37)

  • M. Asiaee et al.

    Voice quality evaluation in patients with COVID-19: an acoustic analysis

    J. Voice

    (2020)
  • P. Bangayan et al.

    Analysis by synthesis of pathological voices using the Klatt synthesizer

    Speech Commun.

    (1997)
  • Y. Bennane et al.

    Synthesis of pathological voices and experiments on the effect of jitter and shimmer in voice quality perception

    Proceedings of the 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B)

    (2017)
  • E. Cataldo et al.

    Jitter generation in voice signals produced by a two-mass stochastic mechanical model

    Biomed. Signal Process. Control

    (2016)
  • E. Cataldo et al.

    Stochastic mechanical model of vocal folds for producing jitter and for identifying pathologies through real voices

    J. Biomech.

    (2018)
  • R. Chiaramonte et al.

    Acoustic analysis of voice in parkinsons disease: a systematic review of voice disability and meta-analysis of studies

    Rev. Neurol.

    (2020)
  • B. Doval et al.

    Spectral correlates of glottal waveform models : an analytic study

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany

    (1997)
  • G. Fant

    The acoustic theory of speech production

    (1960)
  • G. Fant

    Vocal source analysis - a progress report

    STLQPSR

    (1979)
  • G. Fant et al.

    A four parameter model of glottal flow

    STL-QPSR, No. 4

    (1985)
  • N. Henrich et al.

    Glottal Flow Models: Waveforms, Spectra and Physical Measurements

    (2002)
  • D.J. Higham

    An algorithimc introduction to numerical simulation of stochastic differential equations

    SIAM Rev. Soc. Ind. Appl. Math.

    (2001)
  • L.C. Klatt

    Analysis, synthesis, and perception of voice quality variations among female and male talkers

    J. Acoust. Soc. Am.

    (1990)
  • P. Krée et al.

    Mathematics of Random Phenomena

    (1986)
  • G. Li et al.

    Acoustic parameters for the evaluation of voice quality in patients with voice disorders

    Ann. Palliat. Med.

    (2021)
  • L. Li et al.

    A pathological study of bamboo nodule of the vocal fold

    J. Voice

    (2010)
  • S. Lotfi et al.

    Voice disorders identification using multilayer neural network

    Int. Arab J. Inf. Technol.

    (2010)
  • J. Lucero et al.

    Phonation threshold pressure at large asymmetries of the vocal folds

    Biomed. Signal Process. Control

    (2020)
  • Cited by (3)

    View full text