当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
arXiv - CS - Sound Pub Date : 2020-09-17 , DOI: arxiv-2009.08474
Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by sampling from the standard Gaussian prior. To solve this problem, we propose a novel framework for modeling the fine-grained latent variables, considering the dependence on an input text, a hierarchical linguistic structure, and a temporal structure of latent variables. This framework consists of a multi-grained variational autoencoder, a conditional prior, and a multi-level auto-regressive latent converter to obtain the different time-resolution latent variables and sample the finer-level latent variables from the coarser-level ones by taking into account the input text. Experimental results indicate an appropriate method of sampling fine-grained latent variables without the reference signal at the synthesis stage. Our proposed framework also provides the controllability of speaking style in an entire utterance.

中文翻译:

用于表达性语音合成的分层多粒度生成模型

本文提出了一种具有多粒度潜在变量的分层生成模型来合成富有表现力的语音。近年来,细粒度的潜在变量被引入到文本到语音合成中,从而能够对合成语音的韵律和说话风格进行精细控制。然而,当通过从标准高斯先验采样获得这些潜在变量时,语音的自然度会降低。为了解决这个问题,考虑到对输入文本的依赖、分层语言结构和潜在变量的时间结构,我们提出了一种用于对细粒度潜在变量进行建模的新框架。该框架由多粒度变分自编码器、条件先验、和多级自回归潜在转换器,以获取不同的时间分辨率潜在变量,并通过考虑输入文本从较粗级的潜在变量中采样较细级别的潜在变量。实验结果表明,一种在合成阶段没有参考信号的情况下对细粒度潜在变量进行采样的适当方法。我们提出的框架还提供了整个话语中说话风格的可控性。
更新日期:2020-09-21
down
wechat
bug