当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05762
Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo

Text-to-speech systems recently achieved almost indistinguishable quality from human speech. However, the prosody of those systems is generally flatter than natural speech, producing samples with low expressiveness. Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses. This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings, and by substituting the reference encoder with a new learned latent distribution responsible for modeling the intra-sentence variability due to the prosody. By removing the reference encoder dependency, the speaker-leakage problem typically happening in this kind of systems disappears, producing more distinctive syntheses at inference time. The new model achieves significantly higher prosody variance than the baseline in a set of quantitative prosody features, as well as higher speaker distinctiveness, without decreasing the speaker intelligibility. Finally, we observe that the normalized speaker embeddings enable much richer speaker interpolations, substantially improving the distinctiveness of the new interpolated speakers.

中文翻译:

使用残差编码器和标准化流改善多说话者 TTS 韵律方差

文本到语音系统最近实现了与人类语音几乎无法区分的质量。然而,这些系统的韵律通常比自然语音更平坦,产生的样本表现力较低。在文本到语音系统中,说话人身份和韵律的解开对于提高自然度并产生更多可变合成至关重要。本文提出了一种新的神经文本到语音模型,该模型通过在流归一化说话人嵌入上调节类似 Tacotron2 的架构,并用负责对句内建模的新学习的潜在分布替换参考编码器来解决解开问题由于韵律的变化。通过消除对参考编码器的依赖,这种系统中通常发生的扬声器泄漏问题消失了,在推理时产生更独特的合成。新模型在一组定量韵律特征中实现了比基线更高的韵律方差,以及更高的说话人独特性,而不会降低说话人的可懂度。最后,我们观察到归一化的说话人嵌入可以实现更丰富的说话人插值,大大提高了新插值说话人的独特性。
更新日期:2021-06-11
down
wechat
bug