Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Introduction
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). Text-to-speech (TTS), is a technology that generates speech from text. A variety of TTS methods have been proposed to generate natural, intelligible, and human-like speech. Recently, deep neural network (DNN)-based TTS has been intensively investigated, and the results demonstrated that DNN-based TTS can outperform hidden Markov model (HMM)-based TTS in the quality and naturalness of synthesized speech (Zen et al., 2013, Qian et al., 2014, Wu et al., 2015b, Watts et al., 2016). First, a feed-forward neural network (FFNN) was proposed as a replacement for decision-tree approaches in HMM-based TTS (Zen et al., 2013). Subsequently, long short-term memory (LSTM)-based recurrent neural network (RNN) has been adopted and provided better naturalness and prosody because of their capability to model the long-term dependencies of speech (Fan et al., 2014).
In addition to the quality and naturalness, DNN-based TTS has advantages in capabilities for controlling voice aspects. For example, to control speaker identity, several multi-speaker models have been proposed (Fan et al., 2015, Wu et al., 2015a, Hojo et al., 2018). To control speaker changes, a method using auxiliary vectors of the voice’s gender, age, and identity was proposed (Luong et al., 2017). Additionally, a multi-language and multi-speaker model was built by sharing data across languages and speakers (Li and Zen, 2016).
In terms of voice aspects, emotional expression is one of the more important features. Notably, DNN-based TTS was additionally proposed to synthesize speech with emotions. For example, in the simplest approach, usage of the emotional one-hot vector was proposed by An et al. (2017). To control emotional strength, a method that used an auxiliary vector based on listener perception was proposed by Jaime et al. (2018). By using a speaker adaptation method, Yang et al. (2018) proposed a method that generated emotional speech from a small amount of emotional speech training data. In all aforementioned methods, the target speaker’s emotional speech is necessary for training. However, in general, it is difficult for individuals to utter speech with a specific emotion and to continue speaking with the emotion for a few hours. In conclusion, recording the target speaker’s emotional speech is a bottleneck in constructing DNN-based TTS that can synthesize emotional speech with a particular speaker’s voice quality.
To overcome the problem, one possible approach is extrapolation. Emotional expression models are trained using speech uttered by a particular person, and the models are applied to another person to generate emotional speech with the person’s voice quality. In other words, a collection of emotional speech uttered by a target speaker is not required, and emotional expression is generated using models trained by the emotional speech of another person. In summary, the meaning of extrapolation is to borrow emotional models from another individual. Based on this approach, several methods have been proposed, for example, in HMM-based TTS, methods that can generate emotional speech in extrapolation conditions. Kanagawa et al. suggested generating speaker-independent transformation matrices using pairs of neutral and target-style speech, and applying these matrices to a neutral-style model of a new speaker (Kanagawa et al., 2013). Similarly, Jaime et al. (2013) proposed to extrapolate the expressiveness of proven speaking-style models from speakers who utter speech in a neutral speaking style. The proposal included using a constrained structural maximum a posteriori linear regression (CSMAPLR) algorithm (Yamagishi et al., 2009). Ohtani et al. proposed an emotion additive model to extrapolate emotional expression for a neutral voice (Ohtani et al., 2015). All the aforementioned methods above suggest that the extrapolation of emotional expressions is possible by separately modeling the emotional expressions and the speaker identities.
Based on the extrapolation approach, we propose a novel DNN-based TTS that can synthesize emotional speech. The biggest advantage of the proposed algorithm is that we can synthesize several types of emotional speech with the voice qualities of the multiple speakers. This works even if the target speaker’s emotional speech is not included in training data. A key idea is to explicitly control the speaker factor and the emotional factor, motivated by the success in the multi-speaker model (Fan et al., 2015, Wu et al., 2015a, Hojo et al., 2018, Luong et al., 2017, Li and Zen, 2016) and multi-emotional model (An et al., 2017, Jaime et al., 2018, Yang et al., 2018). Once the factors are trained, by independently controlling the factors, we can synthesize speech with any combination of a speaker and an emotion. As training data, we have emotional speech uttered by few speakers, including a neutral speaking style, and have only neutral speech uttered by many speakers. The speaker factor must be trained using the neutral speech uttered by each speaker, and the emotional factor must be trained using the speech uttered by few speakers. To achieve the purpose, we examine five types of DNN architectures: parallel model (PM), serial model (SM), auxiliary input model (AIM), and the hybrid models (PM&AIM and SM&AIM). The PM deals with emotional factors and speaker factors in parallel on the output layer. Also, the SM deals with two factors in serial order on the last hidden layer and output layer. The AIM deals with the two factors by using auxiliary one-hot vectors. Differing from those simple models, the hybrid models are composed two potential pairs: PM and AIM, or SM and AIM. In Inoue et al. (2017), we reported the extrapolation of emotional expressions in acoustic feature modeling, and evaluated the performance of synthesized speech uttered by only female speakers. Additionally, in this paper, we investigate the extrapolation of emotional expressions in phoneme duration modeling, and evaluate the performance of synthesized speech uttered by both males and females.
This paper is organized as follows. In Section 2, we provide an overview of DNN-based TTS and introduce expansions to control multiple voice aspects. In Section 3, we describe the proposed DNN architectures. In Section 4, we explain objective and subjective evaluation. In Section 5, we present our conclusions and suggestions for further research.
Section snippets
DNN-based TTS
DNN-based TTS is a method of speech synthesis that uses the DNN to map linguistic features to acoustic features. A DNN-based TTS system comprises of text analysis, a phoneme duration model, an acoustic model, and waveform synthesis.
The simplest DNN that generates output vector from input vector is expressed by following a recursive formula. is the dimensional output vector of -th layer, and is the
Overview of the proposed method
Fig. 1 presents the proposed DNN-based TTS that generates emotional speech by combining the emotional factor and the speaker factor. In the training step presented in Fig. 1(a), multi-speaker and multi-emotional speech data are used, where speakers and types of emotions are unbalanced. That is, many speakers utter only neutral speech and few speakers utter both neutral and emotional speech. To synthesize emotional speech with the voice quality of the speakers who only utter neutral speech, DNNs
Evaluation experiments
To evaluate the extrapolation performance of the proposed architectures, open-emotion tests and closed-emotion tests are conducted objectively and subjectively. Here, open-emotion and closed-emotion mean that, in the training step, the emotional speech of a target speaker is not included and is included, respectively.
Conclusion and future work
In this paper, to generate emotional expressions using DNN-based TTS, we proposed the following five models: PM, SM, AIM, PM&AIM, and SM&AIM. These models are based on the following extrapolation approach: emotional expression models are trained using speech uttered by a particular person, and the models are applied to another person for generating emotional speech with the person’s voice quality. In other words, the collection of emotional speech uttered by a target speaker is unnecessary, and
CRediT authorship contribution statement
Katsuki Inoue: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Visualization, Writing - original draft, Project administration. Sunao Hara: Resources, Writing - review & editing. Masanobu Abe: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition. Nobukatsu Hojo: Software, Resources, Writing - review & editing. Yusuke Ijima: Conceptualization, Methodology, Software, Resources, Writing - review & editing,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (28)
- et al.
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of repetitive structure in sounds
Speech Commun.
(1999) - An, Shumin, Ling, Zhenhua, Dai, Lirong, 2017. Emotional statistical parametric speech synthesis using LSTM-RNNs. In:...
Multitask learning
Mach. Learn.
(1997)- et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Language Process.
(2010) - Fan, Yuchen, Qian, Yao, Soong, Frank K., He, Lei, 2015. Multi-speaker modeling and speaker adaptation for DNN-based TTS...
- Fan, Yuchen, Qian, Yao, Xie, Feng-Long, Soong, Frank K., 2014. TTS synthesis with bidirectional LSTM based recurrent...
- et al.
DNN-based speech synthesis using speaker codes
IEICE Trans. Inform. Syst.
(2018) - et al.
An investigation to transplant emotional expressions in DNN-based TTS synthesis
- et al.
Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis
Speech Commun.
(2018) - Jaime, Lorenzo-Trueba, Roberto, Barra-Chicote, Watts, Oliver, Montero, Juan Manuel, 2013. Towards speaking style...