Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 4-13-2022 , DOI: 10.1109/taslp.2022.3167258
Sung-feng Huang , Chyi-Jiunn Lin , Da-rong Liu , Yi-chen Chen , Hung-yi Lee

Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

中文翻译：

Meta-TTS：少镜头说话人自适应文本转语音的元学习

个性化语音合成系统是一种非常理想的应用，系统可以用用户的声音和罕见的注册录音生成语音。最近的工作中有两种主要方法来构建这样的系统：说话人适应和说话人编码。一方面，说话人适应方法可以用很少的注册样本对经过训练的多说话人文本到语音（TTS）模型进行微调。然而，它们需要至少数千个微调步骤才能实现高质量的适配，因此很难在设备上应用。另一方面，说话人编码方法将注册话语编码为说话人嵌入。经过训练的 TTS 模型可以根据相应的说话人嵌入来合成用户的语音。然而，说话人编码器面临着见过的和看不见的说话人之间的泛化差距。在本文中，我们提出将元学习算法应用于说话人适应方法。更具体地说，我们使用模型无关元学习（MAML）作为多说话人 TTS 模型的训练算法，其目的是找到一个很好的元初始化来使模型快速适应任何少数说话人适应任务。因此，我们还可以有效地将元训练的 TTS 模型应用于未见过的说话者。我们的实验将所提出的方法（Meta-TTS）与两个基线进行了比较：说话人适应方法基线和说话人编码方法基线。评估结果表明，Meta-TTS 可以从较少的登记样本中合成出高说话人相似度的语音，并且比说话人适应基线更少的适应步骤在相同的训练方案下优于说话人编码基线。当基线的说话人编码器使用额外的 8371 个说话人数据进行预训练时，Meta-TTS 仍然可以在 LibriTTS 数据集上优于基线，并在 VCTK 数据集上达到可比的结果。

更新日期：2024-08-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文