DeepSinger: Singing Voice Synthesis with Data Mined From the Web,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DeepSinger: Singing Voice Synthesis with Data Mined From the Web
arXiv - CS - Sound Pub Date : 2020-07-09 , DOI: arxiv-2007.04590
Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

中文翻译：

DeepSinger：使用从网络挖掘的数据进行歌唱语音合成

在本文中，我们开发了 DeepSinger，这是一种多语言多歌手歌声合成 (SVS) 系统，该系统使用从音乐网站挖掘的歌唱训练数据从头开始构建。DeepSinger 的流水线由数据爬取、歌唱与伴奏分离、歌词与歌唱对齐、数据过滤和歌唱建模等几个步骤组成。具体来说，我们设计了一个歌词到歌唱对齐模型，从粗粒度的句子级别到细粒度的音素级别自动提取歌词中每个音素的持续时间，并进一步设计了一个基于一个前馈变压器直接从歌词生成线性谱图，并使用 Griffin-Lim 合成声音。与以前的 SVS 系统相比，DeepSinger 有几个优点：1) 据我们所知，它是第一个直接从音乐网站挖掘训练数据的 SVS 系统，2) 歌词到歌唱对齐模型进一步避免了任何人工对齐标记并大大降低了标记成本，3) 基于前馈的歌唱模型Transformer 简单高效，通过去除参数合成中复杂的声学特征建模，并利用参考编码器从嘈杂的歌唱数据中捕获歌手的音色，以及 4）它可以合成多语言和多歌手的歌声。我们在挖掘的歌唱数据集上评估 DeepSinger，该数据集包含来自 89 位歌手的三种语言（中文、粤语和英语）的约 92 小时数据。结果表明，使用完全从网络挖掘的歌唱数据，

更新日期：2020-07-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>