Low-resource expressive text-to-speech using data augmentation,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Low-resource expressive text-to-speech using data augmentation
arXiv - CS - Sound Pub Date : 2020-11-11 , DOI: arxiv-2011.05707
Goeric Huybrechts, Thomas Merritt, Giulia Comini, Bartek Perz, Raahil Shah, Jaime Lorenzo-Trueba

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model. Finally, we fine-tune that model to further increase quality. Our evaluations show that the proposed changes bring significant improvements over non-augmented models across many perceived aspects of synthesised speech. We demonstrate the proposed approach on 2 styles (newscaster and conversational), on various speakers, and on both single and multi-speaker models, illustrating the robustness of our approach.

中文翻译：

使用数据增强功能的低资源表达性文本到语音

虽然最近的神经文本语音转换（TTS）系统表现出色，但它们通常需要目标说话者以所需的讲话风格进行大量录音。在这项工作中，我们提出了一种新颖的3步方法来规避录制大量目标数据的昂贵操作，以便在短短15分钟的录音过程中就建立表达风格的声音。首先，我们通过利用其他扬声器以期望的讲话风格进行录音来通过语音转换来增强数据。接下来，我们在可用的记录之上使用合成数据来训练TTS模型。最后，我们微调该模型以进一步提高质量。我们的评估表明，在合成语音的许多感知方面，所提出的更改带来了对非增强模型的显着改进。

更新日期：2021-01-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>