Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model
arXiv - CS - Sound Pub Date : 2020-12-23 , DOI: arxiv-2012.12612
Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

Text-to-speech (TTS) synthesis, a technique for artificially generating human-like utterances from texts, has dramatically evolved with the advances of end-to-end deep neural network-based methods in recent years. The majority of these methods are sentence-level TTS, which can take into account time-series information in the whole sentence. However, it is necessary to establish incremental TTS, which performs synthesis in smaller linguistic units, to realize low-latency synthesis usable for simultaneous speech-to-speech translation systems. In general, incremental TTS is subject to a trade-off between the latency and quality of output speech. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). This study proposes an incremental TTS method that uses the pseudo lookahead generated with a language model to consider the future contextual information without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality without increasing the latency than the method using only observed information and 2) reduces the latency while achieving the equivalent speech quality to waiting for the future context observation.

中文翻译：

使用大型预训练语言模型的伪前瞻增量式文本到语音合成

近年来，随着端到端深度神经网络方法的发展，文本到语音（TTS）合成是一种从文本中人工生成类人话语的技术，其发展迅猛。这些方法大多数是句子级的TTS，它可以考虑整个句子中的时间序列信息。但是，有必要建立以较小的语言单位进行合成的增量式TTS，以实现可用于同时语音到语音翻译系统的低延迟合成。通常，增量TTS会在延迟和输出语音质量之间进行权衡。以低延迟的设置来产生高质量的语音是一项艰巨的任务，这种设置不会大量使用未观察到的未来句子（以下称为“超前”）。这项研究提出了一种增量式TTS方法，该方法使用语言模型生成的伪前瞻来考虑将来的上下文信息而不会增加延迟。我们的方法可以被认为是模仿人类的增量阅读，并使用预训练的GPT2作为前瞻性生成，GPT2考虑了大规模的语言知识。评估结果表明，我们的方法1）与仅使用观察到的信息的方法相比，在不增加等待时间的情况下获得了更高的语音质量，并且2）在达到与等待将来的上下文观察相当的语音质量的同时，减少了等待时间。的增量阅读，并使用预训练的GPT2（该语言解释了大规模的语言知识）进行前瞻性生成。评估结果表明，我们的方法1）与仅使用观察到的信息的方法相比，在不增加等待时间的情况下获得了更高的语音质量，并且2）在达到与等待将来的上下文观察相当的语音质量的同时，减少了等待时间。的增量阅读，并使用预训练的GPT2（该语言解释了大规模的语言知识）进行前瞻性生成。评估结果表明，我们的方法1）与仅使用观察到的信息的方法相比，在不增加等待时间的情况下获得了更高的语音质量，并且2）在达到与等待将来的上下文观察相当的语音质量的同时，减少了等待时间。

更新日期：2020-12-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文