A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

Liu, Junmin; Xie, Zhuangzhuang; Zhang, Chunxia; Shi, Guang

doi:10.1007/s13042-021-01365-x

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

Original Article
Published: 05 July 2021

Volume 12, pages 2809–2823, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Junmin Liu¹,
Zhuangzhuang Xie^1,2,
Chunxia Zhang ORCID: orcid.org/0000-0001-9639-4507¹ &
…
Guang Shi¹

522 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Speech synthesis, an artificial intelligence technology that employs computers to imitate human speech, has played a crucial role in human–computer interaction since it can automatically convert text into speech with satisfactory intelligibility and naturalness. Tacotron2 is the second generation end-to-end English speech synthesis model developed by Google. As Mandarin becomes more and more popular in the world, the associated speech synthesis technologies have been applied in various applications. Aiming at extending Tacotron2 to synthesize Mandarin speech, we propose in this paper a novel synthesis method by adding a Mandarin-to-PinYin module and a prosodic structure prediction model into Tacotron2. By evaluating synthesized results with subjective and objective methods, the added prosodic structure prediction model is demonstrated to help Tacotron2 synthesize more natural and human-like Mandarin speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Speech Synthesis Method Based on Tacotron + WaveNet

Mongolian Text-to-Speech Challenge Under Low-Resource Scenario for NCMMSC2022

Notes

https://www.data-baker.com/hc_znv_1.html.

References

Arik SÖ, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, Shoeybi M (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 195–204
Arik SÖ, Diamos GF, Gibiansky A, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: the 31st conference on neural information processing systems, CA, USA, pp 2962–2970
de Barcelos Silva A, Gomes MM, da Costa CA, da Rosa Righi R, Barbosa JLV, Pessin G, De Doncker G, Federizzi G (2020) Intelligent personal assistants: a systematic literature review. Expert Syst Appl 147:113193
Article Google Scholar
Gonzalvo X, Tazari S, Chan C, (2016) Recent advances in google real-time hmm-driven unit selection synthesizer. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2238–2242
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376
Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Signal Process 23(1):67–72
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representation
Krueger D, Maharaj T, Kramár J, Pezeshki M, Ballas N, Ke NR, Goyal A, Bengio Y, Courville A, Pal C (2017) Zoneout: Regularizing RNNs by randomly preserving hidden activations. In: Proceedings of the 5th international conference on learning representation
Lee J, Cho K, Hofmann T (2017) Fully character-level neural machine translation without explicit segmentation. Trans Assoc Comput Linguist 5:365–378
Article Google Scholar
Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054
Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville AC, Bengio Y (2016) SampleRNN: an unconditional end-to-end neural audio generation model. CoRR. arXiv:1612.07837
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst E99.D(7):1877–1884
Article Google Scholar
Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693
Ping W, Peng K, Gibiansky A, Arik SÖ, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: 2000-speaker neural text-to-speech. CoRR. arXiv:1710.07654
Rix AW (2003) Comparison between subjective listening quality and P.862 PESQ score. In: Proceedings of online workshop measurement of speech and audio quality in networks (MESAQIN’03), pp 17–25
Shechtman S, Sorin A (2019) Sequence to sequence neural speech synthesis with prosody modification capabilities. In: 10th ISCA speech synthesis workshop, pp 275–280
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville AC, Bengio Y (2017) Char2wav: end-to-end speech synthesis. In: International conference on learning representations (ICLR)
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Sun G, Zhang Y, Weiss RJ, Cao Y, Zen H, Wu Y (2020) Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In: Proceedings of the 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6264–6268
Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge
Book Google Scholar
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. CoRR. arXiv:1609.03499
Vazquez-Alvarez Y, Huckvale M (2002) The reliability of the ITU-p.85 standard for the evaluation of text-to-speech systems. In: The 7th international conference on spoken language processing
Wang W, Xu S, Xu B (2016) First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. In: Proceedings interspeech, pp 2243–2247
Wang W, Xu S, Xu B (2016) Gating recurrent mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5520–5524
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous RA (2017) Tacotron: a fully end-to-end text-to-speech synthesis model. In: Proceedings of the 18th annual conference on the international speech communication association, Stockholm, Sweden, pp 4006–4010
Wu Z, King S (2016) Investigating gated recurrent networks for speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5140–5144
Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213
Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2273–2277
Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4470–4474
Zen H, Senior A (2014) Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3844–3848
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064
Article Google Scholar
Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169
Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore
Book Google Scholar

Download references

Acknowledgements

This work is supported by the high-performance computing platform of Xi’an Jiaotong University which provides us convenient and efficient computing resources. This research was supported by the National Key Research and Development Program of China (no. 2018AAA0102201), the National Natural Science Foundations of China (no. 61877049).

Author information

Authors and Affiliations

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Junmin Liu, Zhuangzhuang Xie, Chunxia Zhang & Guang Shi
Beijing Sankuai Online Technology Company Ltd., Beijing, 100102, China
Zhuangzhuang Xie

Authors

Junmin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuangzhuang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Chunxia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guang Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunxia Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Xie, Z., Zhang, C. et al. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2. Int. J. Mach. Learn. & Cyber. 12, 2809–2823 (2021). https://doi.org/10.1007/s13042-021-01365-x

Download citation

Received: 04 January 2021
Accepted: 11 June 2021
Published: 05 July 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s13042-021-01365-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

Abstract

Access this article

Similar content being viewed by others

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Speech Synthesis Method Based on Tacotron + WaveNet

Mongolian Text-to-Speech Challenge Under Low-Resource Scenario for NCMMSC2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

Abstract

Access this article

Similar content being viewed by others

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Speech Synthesis Method Based on Tacotron + WaveNet

Mongolian Text-to-Speech Challenge Under Low-Resource Scenario for NCMMSC2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation