MASS: Multi-task anthropomorphic speech synthesis framework,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MASS: Multi-task anthropomorphic speech synthesis framework
Computer Speech & Language ( IF 3.1 ) Pub Date : 2021-05-21 , DOI: 10.1016/j.csl.2021.101243
Jinyin Chen , Linhui Ye , Zhaoyan Ming

Text-to-Speech (TTS) synthesis plays an important role in human-computer interaction. Currently, most TTS technologies focus on the naturalness of speech, namely, making the speeches sound like humans. However, the key tasks of the expression of emotion and the speaker identity are ignored, which limits the application scenarios of TTS synthesis technology. To make the synthesized speech more realistic and expand the application scenarios, we propose a multi-task anthropomorphic speech synthesis framework (MASS), which can synthesize speeches from text with specified emotion and speaker identity. The MASS framework consists of a base TTS module and two novel voice conversion modules: the emotional voice conversion module and the speaker voice conversion module. We propose deep emotion voice conversion model (DEVC) and deep speaker voice conversion model (DSVC) based on convolution residual networks. It solves the problem of feature loss during voice conversion. The model trainings are independent of parallel datasets, and are capable of many-to-many voice conversion. In the emotional voice conversion, speaker voice conversion experiments, as well as the multi-task speech synthesis experiments, experimental results show DEVC and DSVC convert speech effectively. The quantitative and qualitative evaluation results of multi-task speech synthesis experiments show MASS can effectively synthesis speech with specified text, emotion and speaker identity.

中文翻译：

MASS：多任务拟人语音合成框架

文本到语音 (TTS) 合成在人机交互中起着重要作用。目前，大多数TTS技术都着眼于语音的自然性，即让语音听起来像人类。然而，忽略了情感表达和说话人身份这两个关键任务，限制了TTS合成技术的应用场景。为了使合成的语音更加逼真并扩展应用场景，我们提出了一种多任务拟人语音合成框架（MASS），它可以从具有指定情感和说话人身份的文本中合成语音。MASS 框架由一个基本的 TTS 模块和两个新颖的语音转换模块组成：情感语音转换模块和说话者语音转换模块。我们提出了基于卷积残差网络的深度情感语音转换模型（DEVC）和深度说话人语音转换模型（DSVC）。解决了语音转换过程中特征丢失的问题。模型训练独立于并行数据集，并且能够进行多对多语音转换。在情感语音转换、说话人语音转换实验以及多任务语音合成实验中，实验结果表明DEVC和DSVC能够有效地转换语音。多任务语音合成实验的定量和定性评价结果表明，MASS 可以有效地合成具有指定文本、情感和说话人身份的语音。模型训练独立于并行数据集，并且能够进行多对多语音转换。在情感语音转换、说话人语音转换实验以及多任务语音合成实验中，实验结果表明DEVC和DSVC能够有效地转换语音。多任务语音合成实验的定量和定性评价结果表明，MASS 可以有效地合成具有指定文本、情感和说话人身份的语音。模型训练独立于并行数据集，并且能够进行多对多语音转换。在情感语音转换、说话人语音转换实验以及多任务语音合成实验中，实验结果表明DEVC和DSVC能够有效地转换语音。多任务语音合成实验的定量和定性评价结果表明，MASS 可以有效地合成具有指定文本、情感和说话人身份的语音。

更新日期：2021-06-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文