Skip to main content

Advertisement

Log in

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Speech synthesis, an artificial intelligence technology that employs computers to imitate human speech, has played a crucial role in human–computer interaction since it can automatically convert text into speech with satisfactory intelligibility and naturalness. Tacotron2 is the second generation end-to-end English speech synthesis model developed by Google. As Mandarin becomes more and more popular in the world, the associated speech synthesis technologies have been applied in various applications. Aiming at extending Tacotron2 to synthesize Mandarin speech, we propose in this paper a novel synthesis method by adding a Mandarin-to-PinYin module and a prosodic structure prediction model into Tacotron2. By evaluating synthesized results with subjective and objective methods, the added prosodic structure prediction model is demonstrated to help Tacotron2 synthesize more natural and human-like Mandarin speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://www.data-baker.com/hc_znv_1.html.

References

  1. Arik SÖ, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, Shoeybi M (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 195–204

  2. Arik SÖ, Diamos GF, Gibiansky A, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: the 31st conference on neural information processing systems, CA, USA, pp 2962–2970

  3. de Barcelos Silva A, Gomes MM, da Costa CA, da Rosa Righi R, Barbosa JLV, Pessin G, De Doncker G, Federizzi G (2020) Intelligent personal assistants: a systematic literature review. Expert Syst Appl 147:113193

    Article  Google Scholar 

  4. Gonzalvo X, Tazari S, Chan C, (2016) Recent advances in google real-time hmm-driven unit selection synthesizer. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2238–2242

  5. Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  6. Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376

  7. Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Signal Process 23(1):67–72

    Article  Google Scholar 

  8. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representation

  9. Krueger D, Maharaj T, Kramár J, Pezeshki M, Ballas N, Ke NR, Goyal A, Bengio Y, Courville A, Pal C (2017) Zoneout: Regularizing RNNs by randomly preserving hidden activations. In: Proceedings of the 5th international conference on learning representation

  10. Lee J, Cho K, Hofmann T (2017) Fully character-level neural machine translation without explicit segmentation. Trans Assoc Comput Linguist 5:365–378

    Article  Google Scholar 

  11. Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054

  12. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville AC, Bengio Y (2016) SampleRNN: an unconditional end-to-end neural audio generation model. CoRR. arXiv:1612.07837

  13. Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst E99.D(7):1877–1884

    Article  Google Scholar 

  14. Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693

  15. Ping W, Peng K, Gibiansky A, Arik SÖ, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: 2000-speaker neural text-to-speech. CoRR. arXiv:1710.07654

  16. Rix AW (2003) Comparison between subjective listening quality and P.862 PESQ score. In: Proceedings of online workshop measurement of speech and audio quality in networks (MESAQIN’03), pp 17–25

  17. Shechtman S, Sorin A (2019) Sequence to sequence neural speech synthesis with prosody modification capabilities. In: 10th ISCA speech synthesis workshop, pp 275–280

  18. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783

  19. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville AC, Bengio Y (2017) Char2wav: end-to-end speech synthesis. In: International conference on learning representations (ICLR)

  20. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  21. Sun G, Zhang Y, Weiss RJ, Cao Y, Zen H, Wu Y (2020) Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In: Proceedings of the 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6264–6268

  22. Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  23. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. CoRR. arXiv:1609.03499

  24. Vazquez-Alvarez Y, Huckvale M (2002) The reliability of the ITU-p.85 standard for the evaluation of text-to-speech systems. In: The 7th international conference on spoken language processing

  25. Wang W, Xu S, Xu B (2016) First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. In: Proceedings interspeech, pp 2243–2247

  26. Wang W, Xu S, Xu B (2016) Gating recurrent mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5520–5524

  27. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous RA (2017) Tacotron: a fully end-to-end text-to-speech synthesis model. In: Proceedings of the 18th annual conference on the international speech communication association, Stockholm, Sweden, pp 4006–4010

  28. Wu Z, King S (2016) Investigating gated recurrent networks for speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5140–5144

  29. Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213

  30. Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2273–2277

  31. Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4470–4474

  32. Zen H, Senior A (2014) Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3844–3848

  33. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  34. Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169

  35. Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore

    Book  Google Scholar 

Download references

Acknowledgements

This work is supported by the high-performance computing platform of Xi’an Jiaotong University which provides us convenient and efficient computing resources. This research was supported by the National Key Research and Development Program of China (no. 2018AAA0102201), the National Natural Science Foundations of China (no. 61877049).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunxia Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Xie, Z., Zhang, C. et al. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2. Int. J. Mach. Learn. & Cyber. 12, 2809–2823 (2021). https://doi.org/10.1007/s13042-021-01365-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01365-x

Keywords

Navigation