当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-12-15 , DOI: 10.1016/j.csl.2020.101183
Yusuke Yasuda , Xin Wang , Junichi Yamagishi

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features.

In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use neural autoregressive (AR) probabilistic modeling and a neural vocoder in the same way as the sequence-to-sequence systems do for a fair and deep analysis in this paper. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages.

Our experiments on Japanese demonstrated that the Tacotron TTS systems with increased parameter size and input of phonemes and accentual type labels outperformed the DNN-based pipeline systems using the complicated linguistic features and that its encoder could learn to compensate for a lack of rich linguistic features. Our experiments on English demonstrated that, when using a suitable encoder, the Tacotron TTS system with characters as input can disambiguate pronunciations and produce natural speech as good as those of the systems using phonemes. However, we also found that the encoder could not learn English stressed syllables from characters perfectly and hence resulted in flatter fundamental frequency. In summary, these experimental results suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.



中文翻译:

序列到序列语音合成中语言特征学习能力的研究

神经序列到序列文本到语音合成(TTS)可以直接从文本或简单的语言特征(如音素)中产生高质量的语音。与传统的管道TTS不同,神经序列到序列TTS不需要人工注释和复杂的语言功能,例如词性标签和语法结构即可进行系统训练。但是,必须仔细设计和优化它,以便它可以从输入要素中隐式提取有用的语言要素。

在本文中,我们与基于深度神经网络(DNN)的管道TTS系统进行了比较,研究了在什么条件下神经序列到序列TTS可以在日语和英语中很好地工作。与过去的比较研究不同,管道系统还使用神经自回归(AR)概率建模和神经声码器,与序列到序列系统在本文中进行公正而深入的分析一样。我们从三个方面研究了系统:a)模型体系结构,b)模型参数大小,以及c)语言。对于模型架构方面,我们采用了我们先前提出的改进的Tacotron系统,以及使用Tacotron或Tacotron2的编码器的变体。对于模型参数大小方面,我们研究了两个模型参数大小。在语言方面,

我们在日语上的实验表明,使用复杂的语言功能的Tacotron TTS系统具有增加的参数大小以及音素和重音类型标签输入,其性能优于基于DNN的管道系统,并且其编码器可以学习以弥补缺乏丰富的语言功能。我们的英语实验表明,使用合适的编码器时,以字符作为输入的Tacotron TTS系统可以消除语音歧义,并产生自然音质,与使用音素的系统一样。但是,我们还发现编码器无法完美地从字符中学习英语重读音节,从而导致基本频率更加平坦。综上所述,

更新日期:2020-12-22
down
wechat
bug