Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-12-04 , DOI: 10.1186/s13636-021-00225-4
Zolzaya Byambadorj ₁ , Ryota Nishimura ₁ , Altangerel Ayush ₂ , Kengo Ohta ₃ , Norihide Kitaoka ₄

Affiliation

Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

中文翻译：

使用跨语言迁移学习和数据增强的低资源语言的文本到语音系统

深度学习技术目前正应用于自动文本转语音 (TTS) 系统，从而显着提高了性能。然而，这些方法需要大量的文本-语音配对数据来进行模型训练，并且收集这些数据的成本很高。因此，在本文中，我们提出了一个包含目标语言的频谱图预测网络和神经声码器的单扬声器 TTS 系统，仅使用 30 分钟的目标语言文本-语音配对数据进行训练。我们评估了三种训练 TTS 系统的谱图预测模型的方法，这些方法从输入音素序列生成梅尔谱图：(1) 跨语言迁移学习，(2) 数据增强，以及 (3) 之前的组合两种方法。在跨语言迁移学习方法中，我们使用了两个高资源语言数据集，英语（24 小时）和日语（10 小时）。我们还使用 30 分钟的目标语言数据在所有三种方法中进行训练，并生成用于方法 2 和 3 中训练的增强数据。我们发现，在训练期间同时使用跨语言迁移学习和增强数据的效果最好自然合成目标语音输出。我们还分别使用顺序和同时训练比较了单扬声器和多扬声器训练方法。发现多扬声器模型对于构建单扬声器、低资源 TTS 模型更有效。此外，我们训练了两个 Parallel WaveGAN (PWG) 神经声码器，一个使用 13 小时的增强数据和 30 分钟的目标语言数据，另一个使用原始目标语言数据集的整个 12 小时。我们的主观 AB 偏好测试表明，使用增强数据训练的神经声码器与使用整个目标语言数据集训练的声码器获得的感知语音质量几乎相同。总的来说，我们发现我们提出的由频谱图预测网络和 PWG 神经声码器组成的 TTS 系统仅使用 30 分钟的目标语言训练数据就能够实现合理的性能。我们还发现，通过使用 3 小时的目标语言数据来训练模型和生成增强数据，我们提出的 TTS 模型能够实现与使用 12 小时目标语言训练的基线模型非常相似的性能数据。

更新日期：2021-12-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文