Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis
arXiv - CS - Multimedia Pub Date : 2020-07-11 , DOI: arxiv-2007.05764
Ankit Sharma, Puneet Kumar, Vikas Maddukuri, Nagasai Madamshettib, Kishore KG, Sahit Sai Sriram Kavurub, Balasubramanian Raman and Partha Pratim Roy

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.

中文翻译：

用于文本到语音合成的基于快速 Griffin Lim 的波形生成策略

文本到语音 (TTS) 系统的性能在很大程度上取决于频谱图到波形的生成，也称为语音重建阶段。所需的时间称为合成延迟。本文提出了一种降低语音合成延迟的方法。它旨在增强实时应用程序的 TTS 系统，如数字助理、移动电话、嵌入式设备等。所提出的方法在语音合成阶段应用 Fast Griffin Lim 算法（FGLA）代替 Griffin Lim 算法（GLA）作为声码器. GLA 和 FGLA 都是迭代的，但 FGLA 的收敛速度比 GLA 快。所提出的方法在 LJSpeech、Blizzard 和 Tatoeba 数据集上进行了测试，并将 FGLA 的结果与 GLA 和基于神经生成对抗网络 (GAN) 的声码器进行了比较。基于合成延迟和语音质量评估性能。观察到语音合成延迟减少了 36.58%。输出语音的质量得到了提高，这是由更高的平均意见分数 (MOS) 和与 FGLA 相对于 GLA 的更快收敛所提倡的。

更新日期：2020-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>