Elsevier

Speech Communication

Volume 126, February 2021, Pages 35-43
Speech Communication

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

https://doi.org/10.1016/j.specom.2020.11.004Get rights and content

Highlights

  • We propose model architectures to synthesize emotional speech in extrapolation.

  • The target speaker borrows emotional expressions from the data of other speakers.

  • Neural Network is trained with multi-speaker and multi-emotional speech.

  • The speech dataset includes many speakers utter only neutral speech.

  • The model architecture having an expanded output layer showed the best performance.

Abstract

This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.

Introduction

This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). Text-to-speech (TTS), is a technology that generates speech from text. A variety of TTS methods have been proposed to generate natural, intelligible, and human-like speech. Recently, deep neural network (DNN)-based TTS has been intensively investigated, and the results demonstrated that DNN-based TTS can outperform hidden Markov model (HMM)-based TTS in the quality and naturalness of synthesized speech (Zen et al., 2013, Qian et al., 2014, Wu et al., 2015b, Watts et al., 2016). First, a feed-forward neural network (FFNN) was proposed as a replacement for decision-tree approaches in HMM-based TTS (Zen et al., 2013). Subsequently, long short-term memory (LSTM)-based recurrent neural network (RNN) has been adopted and provided better naturalness and prosody because of their capability to model the long-term dependencies of speech (Fan et al., 2014).

In addition to the quality and naturalness, DNN-based TTS has advantages in capabilities for controlling voice aspects. For example, to control speaker identity, several multi-speaker models have been proposed (Fan et al., 2015, Wu et al., 2015a, Hojo et al., 2018). To control speaker changes, a method using auxiliary vectors of the voice’s gender, age, and identity was proposed (Luong et al., 2017). Additionally, a multi-language and multi-speaker model was built by sharing data across languages and speakers (Li and Zen, 2016).

In terms of voice aspects, emotional expression is one of the more important features. Notably, DNN-based TTS was additionally proposed to synthesize speech with emotions. For example, in the simplest approach, usage of the emotional one-hot vector was proposed by An et al. (2017). To control emotional strength, a method that used an auxiliary vector based on listener perception was proposed by Jaime et al. (2018). By using a speaker adaptation method, Yang et al. (2018) proposed a method that generated emotional speech from a small amount of emotional speech training data. In all aforementioned methods, the target speaker’s emotional speech is necessary for training. However, in general, it is difficult for individuals to utter speech with a specific emotion and to continue speaking with the emotion for a few hours. In conclusion, recording the target speaker’s emotional speech is a bottleneck in constructing DNN-based TTS that can synthesize emotional speech with a particular speaker’s voice quality.

To overcome the problem, one possible approach is extrapolation. Emotional expression models are trained using speech uttered by a particular person, and the models are applied to another person to generate emotional speech with the person’s voice quality. In other words, a collection of emotional speech uttered by a target speaker is not required, and emotional expression is generated using models trained by the emotional speech of another person. In summary, the meaning of extrapolation is to borrow emotional models from another individual. Based on this approach, several methods have been proposed, for example, in HMM-based TTS, methods that can generate emotional speech in extrapolation conditions. Kanagawa et al. suggested generating speaker-independent transformation matrices using pairs of neutral and target-style speech, and applying these matrices to a neutral-style model of a new speaker (Kanagawa et al., 2013). Similarly, Jaime et al. (2013) proposed to extrapolate the expressiveness of proven speaking-style models from speakers who utter speech in a neutral speaking style. The proposal included using a constrained structural maximum a posteriori linear regression (CSMAPLR) algorithm (Yamagishi et al., 2009). Ohtani et al. proposed an emotion additive model to extrapolate emotional expression for a neutral voice (Ohtani et al., 2015). All the aforementioned methods above suggest that the extrapolation of emotional expressions is possible by separately modeling the emotional expressions and the speaker identities.

Based on the extrapolation approach, we propose a novel DNN-based TTS that can synthesize emotional speech. The biggest advantage of the proposed algorithm is that we can synthesize several types of emotional speech with the voice qualities of the multiple speakers. This works even if the target speaker’s emotional speech is not included in training data. A key idea is to explicitly control the speaker factor and the emotional factor, motivated by the success in the multi-speaker model (Fan et al., 2015, Wu et al., 2015a, Hojo et al., 2018, Luong et al., 2017, Li and Zen, 2016) and multi-emotional model (An et al., 2017, Jaime et al., 2018, Yang et al., 2018). Once the factors are trained, by independently controlling the factors, we can synthesize speech with any combination of a speaker and an emotion. As training data, we have emotional speech uttered by few speakers, including a neutral speaking style, and have only neutral speech uttered by many speakers. The speaker factor must be trained using the neutral speech uttered by each speaker, and the emotional factor must be trained using the speech uttered by few speakers. To achieve the purpose, we examine five types of DNN architectures: parallel model (PM), serial model (SM), auxiliary input model (AIM), and the hybrid models (PM&AIM and SM&AIM). The PM deals with emotional factors and speaker factors in parallel on the output layer. Also, the SM deals with two factors in serial order on the last hidden layer and output layer. The AIM deals with the two factors by using auxiliary one-hot vectors. Differing from those simple models, the hybrid models are composed two potential pairs: PM and AIM, or SM and AIM. In Inoue et al. (2017), we reported the extrapolation of emotional expressions in acoustic feature modeling, and evaluated the performance of synthesized speech uttered by only female speakers. Additionally, in this paper, we investigate the extrapolation of emotional expressions in phoneme duration modeling, and evaluate the performance of synthesized speech uttered by both males and females.

This paper is organized as follows. In Section 2, we provide an overview of DNN-based TTS and introduce expansions to control multiple voice aspects. In Section 3, we describe the proposed DNN architectures. In Section 4, we explain objective and subjective evaluation. In Section 5, we present our conclusions and suggestions for further research.

Section snippets

DNN-based TTS

DNN-based TTS is a method of speech synthesis that uses the DNN to map linguistic features to acoustic features. A DNN-based TTS system comprises of text analysis, a phoneme duration model, an acoustic model, and waveform synthesis.

The simplest DNN that generates output vector y from input vector x is expressed by following a recursive formula. h()=f()(W()h(1)+b())where1L,h(0)=x,h(L)=y. h(1)Rd1×1 is the d1 dimensional output vector of (1)-th layer, and h()Rd×1 is the d

Overview of the proposed method

Fig. 1 presents the proposed DNN-based TTS that generates emotional speech by combining the emotional factor and the speaker factor. In the training step presented in Fig. 1(a), multi-speaker and multi-emotional speech data are used, where speakers and types of emotions are unbalanced. That is, many speakers utter only neutral speech and few speakers utter both neutral and emotional speech. To synthesize emotional speech with the voice quality of the speakers who only utter neutral speech, DNNs

Evaluation experiments

To evaluate the extrapolation performance of the proposed architectures, open-emotion tests and closed-emotion tests are conducted objectively and subjectively. Here, open-emotion and closed-emotion mean that, in the training step, the emotional speech of a target speaker is not included and is included, respectively.

Conclusion and future work

In this paper, to generate emotional expressions using DNN-based TTS, we proposed the following five models: PM, SM, AIM, PM&AIM, and SM&AIM. These models are based on the following extrapolation approach: emotional expression models are trained using speech uttered by a particular person, and the models are applied to another person for generating emotional speech with the person’s voice quality. In other words, the collection of emotional speech uttered by a target speaker is unnecessary, and

CRediT authorship contribution statement

Katsuki Inoue: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Visualization, Writing - original draft, Project administration. Sunao Hara: Resources, Writing - review & editing. Masanobu Abe: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition. Nobukatsu Hojo: Software, Resources, Writing - review & editing. Yusuke Ijima: Conceptualization, Methodology, Software, Resources, Writing - review & editing,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (28)

  • KawaharaHideki et al.

    Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of repetitive structure in sounds

    Speech Commun.

    (1999)
  • An, Shumin, Ling, Zhenhua, Dai, Lirong, 2017. Emotional statistical parametric speech synthesis using LSTM-RNNs. In:...
  • CaruanaRich

    Multitask learning

    Mach. Learn.

    (1997)
  • DehakNajim et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio Speech Language Process.

    (2010)
  • Fan, Yuchen, Qian, Yao, Soong, Frank K., He, Lei, 2015. Multi-speaker modeling and speaker adaptation for DNN-based TTS...
  • Fan, Yuchen, Qian, Yao, Xie, Feng-Long, Soong, Frank K., 2014. TTS synthesis with bidirectional LSTM based recurrent...
  • HojoNobukatsu et al.

    DNN-based speech synthesis using speaker codes

    IEICE Trans. Inform. Syst.

    (2018)
  • InoueKatsuki et al.

    An investigation to transplant emotional expressions in DNN-based TTS synthesis

  • JaimeLorenzo-Trueba et al.

    Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis

    Speech Commun.

    (2018)
  • Jaime, Lorenzo-Trueba, Roberto, Barra-Chicote, Watts, Oliver, Montero, Juan Manuel, 2013. Towards speaking style...
  • Kanagawa, Hiroki, Nose, Takashi, Kobayashi, Takao, 2013. Speaker-independent style conversion for HMM-based expressive...
  • Li, Bo, Zen, Heiga, 2016. Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric...
  • Luong, Hieu-Thi, Takaki, Shinji, Henter, Gustav Eje, Yamagishi, Junichi, 2017. Adapting and controlling DNN-based...
  • Ohtani, Yamato, Nasu, Yu, Morita, Masahiro, Akamine, Masami, 2015. Emotional transplant in statistical speech synthesis...
  • Cited by (0)

    View full text