当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis
arXiv - CS - Sound Pub Date : 2020-01-16 , DOI: arxiv-2001.05685
Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E. Gonzalez, Kurt Keutzer

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave.



自动语音合成是一项具有挑战性的任务,随着边缘设备开始通过语音与用户交互,它变得越来越重要。典型的文本到语音管道包括声码器,它将中间音频表示转换为音频波形。大多数现有的声码器难以并行化,因为每个生成的样本都以先前的样本为条件。WaveGlow 是这些自回归模型的基于流的前馈替代方案(Prenger 等,2019)。然而,虽然 WaveGlow 可以很容易地并行化,但该模型对于边缘的实时语音合成来说过于昂贵。本文介绍了 SqueezeWave,这是一个基于 WaveGlow 的轻量级声码器系列,可以生成与 WaveGlow 质量相似的音频,MAC 减少 61 到 214 倍。代码,训练有素的模型,