A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2019-12-01 , DOI: 10.1186/s13636-019-0163-y
Marc Freixes , Francesc Alías , Joan Claudi Socoró

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

中文翻译：

从中性语音中选择文本到语音和唱歌的单元选择：概念证明

文本到语音 (TTS) 合成系统已广泛用于基于语音生成的通用应用中。尽管如此，有些领域，例如讲故事或语音输出辅助设备，也可能需要唱歌。为了使基于语料库的 TTS 系统能够唱歌，应该记录一个补充的歌唱数据库。然而，这种解决方案对于最终的歌唱需求来说可能过于昂贵，或者如果原始扬声器不可用或无法正确歌唱，则甚至不可行。这项工作介绍了一种基于单元选择的文本到语音和唱歌 (US-TTS&S) 合成框架，该框架集成了语音到唱歌 (STS) 转换，能够从输入文本生成语音和唱歌一个分数，分别使用相同的中性语音语料库。该提案的可行性是在使用 2.6 小时西班牙中性语音语料库的概念验证实施中考虑三个音域和两个节奏来评估的。实验表明，在超出语料库音域和/或音符长度超过 150 毫秒的情况下唱歌需要具有挑战性的 STS 转换因子。虽然分数驱动的 US 配置允许减少音高比例因素，但由于口语元音的长度较短，时间比例因素不会减少。此外，在 MUSHRA 测试中，文本驱动和分数驱动的美国配置在所有分析的场景中获得了大约 40 的相似自然率。尽管这些自然度得分与 vocaloid 的得分相差甚远，但获得的 60 左右的歌唱得分证明该框架可以合理地满足最终的歌唱需求。

更新日期：2019-12-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文