Abstract
The phenomenon of the voice signal acoustic variability in automatic speech recognition systems is considered. There are two varieties—intra- and inter-speaker speech variability. The probabilistic cluster model of minimal speech units in the Kullback–Leibler information metric is used for their mathematical description and comparison in magnitude. On its basis, theoretical estimates of the voice signal acoustic variability for each of its varieties are obtained separately. The effect of information security in systems with tuning to the authorized user voice is described and quantitatively characterized. The intra-speaker variability is negligible in comparison with the inter-speaker variability of speech, and therefore does not have a noticeable harmful effect on the effectiveness of automatic speech recognition. The computational experiment is set up to confirm and develop the theoretical research results, where two speech streams from two different speakers are considered. The author’s software is used for its implementation. According to the experimental results we find that the level of inter-speaker speech variability in a number of cases goes beyond the inter-phonemic differences within a homogeneous speech flow. Therefore, in systems with tuning to the speaker voice, the effect of voice signal acoustic variability is not only unambiguously generally positive, namely: it is an information protection from unauthorized access, but also it is significant in terms of probability-theoretic relation. The obtained results are intended for the development of new and modernization of existing systems for automatic speech recognition, designed to work in a standalone mode.
Similar content being viewed by others
Notes
Here we mean the obvious circumstance that the reference signals x k,r in general case can have a dimension that is much larger than the dimension of the speech frame x [12], but in this case it is not essential.
The program website: https://sites.google.com/site/frompldcreators/produkty-1/phonemetraining .
Sound file “5. Sounds and letters” on the site [in Russian] https://mixmuz.ru/mp3/%D0%B3%D0%BB%D0%B0%D1%81%D0%BD%D1%8B%D0%B5%20%D0%B7%D0%B2%D1%83%D0%BA%D0%B8 .
Recommendations site P.1120: https://www.itu.int/ITU-T/recommendations/rec.aspx?id=13176 .
References
L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing (Pearson, Boston, 2010). URI: https://www.amazon.com/Theory-Applications-Digital-Speech-Processing/dp/0136034284.
I. B. Tampel, "Automatic speech recognition – the main stages over last 50 years," Sci. Tech. J. Inf. Technol. Mech. Opt., v.100, n.6, p.957 (2015). DOI: https://doi.org/10.17586/2226-1494-2015-15-6-957-968.
D. Yu, L. Deng, Automatic Speech Recognition (Springer London, London, 2015). DOI: https://doi.org/10.1007/978-1-4471-5779-3.
A. Rogowski, "Industrially oriented voice control system," Robot. Comput. Manuf., v.28, n.3, p.303 (2012). DOI: https://doi.org/10.1016/j.rcim.2011.09.010.
M. Schuster, "Speech Recognition for Mobile Devices at Google," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Springer, Berlin, Heidelberg, 2010). DOI: https://doi.org/10.1007/978-3-642-15246-7_3.
R. Rammohan, N. Dhanabalsamy, V. Dimov, F. J. Eidelman, "Smartphone Conversational Agents (Apple Siri, Google, Windows Cortana) and Questions about Allergy and Asthma Emergencies," J. Allergy Clin. Immunol., v.139, n.2, p.AB250 (2017). DOI: https://doi.org/10.1016/j.jaci.2016.12.804.
V. V. Savchenko, A. V. Savchenko, "Information-theoretic analysis of efficiency of the phonetic encoding–decoding method in automatic speech recognition," J. Commun. Technol. Electron., v.61, n.4, p.430 (2016). DOI: https://doi.org/10.1134/S1064226916040112.
R. A. Ustinov, "Specific features of modern voice protection systems," Bezop. Inf. Tehnol., v.24, n.4, p.71 (2017). DOI: https://doi.org/10.26583/bit.2017.4.08.
Z. Wu, Information Hiding in Speech Signal for Secure Communication (Elsevier, Amsterdam, 2015). DOI: https://doi.org/10.1016/C2013-0-19179-9.
S. M. Qaisar, N. Hainmad, R. Khan, R. Asfour, "A Speech to Machine Interface Based on Perceptual Linear Prediction and Classification," in 2019 Advances in Science and Engineering Technology International Conferences (ASET) (IEEE, Washington, 2019). DOI: https://doi.org/10.1109/ICASET.2019.8714304.
R. González Hautamäki, M. Sahidullah, V. Hautamäki, T. Kinnunen, "Acoustical and perceptual study of voice disguise by age modification in speaker verification," Speech Commun., v.95, p.1 (2017). DOI: https://doi.org/10.1016/j.specom.2017.10.002.
V. V. Savchenko, "Minimum of Information Divergence Criterion for Signals with Tuning to Speaker Voice in Automatic Speech Recognition," Radioelectron. Commun. Syst., v.63, n.1, p.42 (2020). DOI: https://doi.org/10.3103/S0735272720010045.
S. Heald, S. Klos, H. Nusbaum, "Understanding Speech in the Context of Variability," in Neurobiology of Language (Academic Press, Cambridge, MA, 2016). DOI: https://doi.org/10.1016/B978-0-12-407794-2.00017-1.
I. A. Sieber, G. A. Moroz, "Estimating the Acoustic Variation of s via Principal Component Analysis," NSU Vestnik. Ser. Linguist. Intercult. Commun., v.17, n.1, p.49 (2019). DOI: https://doi.org/10.25205/1818-7935-2019-17-1-49-64.
J. H. L. Hansen, H. Bořil, "On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks," Speech Commun., v.101, n.0, p.94 (2018). DOI: https://doi.org/10.1016/j.specom.2018.05.004.
N. А. Krasheninnikova, "Main Factors Interfering with Recognition of Speech Commands," Simbirsk Sci. Bull., v.0, n.1, p.201 (2011).
V. V. Savchenko, L. V. Savchenko, "Method for Measuring the Intelligibility of Speech Signals in the Kullback–Leibler Information Metric," Meas. Tech., v.62, n.9, p.832 (2019). DOI: https://doi.org/10.1007/s11018-019-01702-1.
O. F. Krivnova, "Prosodic phrasing in spoken text: localization of breathing pauses," in Computational linguistics and intelligent technologies: based on the materials of the international conference (Dialog, Moscow, 2016). URI: http://www.dialog-21.ru/media/3404/krivnovaof.pdf.
V. V. Savchenko, "Itakura–Saito Divergence as an Element of the Information Theory of Speech Perception," J. Commun. Technol. Electron., v.64, n.6, p.590 (2019). DOI: https://doi.org/10.1134/S1064226919060093.
V. V. Savchenko, "Estimation of the Phonetic Speech Quality Using the Information Theoretic Approach," J. Commun. Technol. Electron., v.63, n.1, p.53 (2018). DOI: https://doi.org/10.1134/S1064226918010126.
S. Kullback, Information Theory and Statistics (Dover Publications, New York, 1997). URI: https://www.amazon.com/Information-Theory-Statistics-Dover-Mathematics/dp/0486696847.
V. V. Savchenko, "Criterion for Minimum of Mean Information Deviation for Distinguishing Random Signals with Similar Characteristics," Radioelectron. Commun. Syst., v.61, n.9, p.419 (2018). DOI: https://doi.org/10.3103/S0735272718090042.
V. V. Savchenko, A. V. Savchenko, "Criterion of Significance Level for Selection of Order of Spectral Estimation of Entropy Maximum," Radioelectron. Commun. Syst., v.62, n.5, p.223 (2019). DOI: https://doi.org/10.3103/S0735272719050042.
H. B. Dwight, Tables of Integrals and Other Mathematical Data (Macmillan, New York, 1961).
"Linear prediction," in Springer Handbook of Speech Processing (Springer Berlin Heidelberg, Berlin, Heidelberg, 2008). DOI: https://doi.org/10.1007/978-3-540-49127-9.
P. H. Müller, P. Neumann, R. Storm, "Tafeln der mathematischen Statistik," VEB Fachbuchverlag, v.0, n.0, p.279 (1973). URI: http://doi.wiley.com/10.1002/bimj.19740160816.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
ADDITIONAL INFORMATION
V. V. Savchenko
The author declares that he has no conflict of interest.
The initial version of this paper in Russian is published in the journal “Izvestiya Vysshikh Uchebnykh Zavedenii. Radioelektronika,” ISSN 2307-6011 (Online), ISSN 0021-3470 (Print) on the link http://radio.kpi.ua/article/view/S0021347020100039 with DOI: https://doi.org/10.20535/S0021347020100039
About this article
Cite this article
Savchenko, V.V. Acoustic Variability of Voice Signal as Factor of Information Security for Automatic Speech Recognition Systems with Tuning to User Voice. Radioelectron.Commun.Syst. 63, 532–542 (2020). https://doi.org/10.3103/S0735272720100039
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0735272720100039