当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Translatotron 2: Robust direct speech-to-speech translation
arXiv - CS - Sound Pub Date : 2021-07-19 , DOI: arxiv-2107.08661
Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz

We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.

中文翻译:

Translatotron 2:强大的直接语音到语音翻译

我们提出了 Translatotron 2,这是一种可以进行端到端训练的神经直接语音到语音翻译模型。Translatotron 2 由一个语音编码器、一个音素解码器、一个梅尔谱合成器和一个连接前三个组件的注意力模块组成。实验结果表明,Translatotron 2 在翻译质量和预测语音自然度方面大大优于原始 Translatotron,并通过减少过度生成(如牙牙学语或长时间停顿)大大提高了预测语音的鲁棒性。我们还提出了一种在翻译后的语音中保留源说话者声音的新方法。训练后的模型仅限于保留源说话者的声音,并且与原始 Translatotron 不同,它无法生成不同说话者的语音” s 声音,通过减少潜在的误用来创建欺骗音频伪像,使模型在生产部署中更加健壮。当新方法与简单的基于串联的数据增强一起使用时,经过训练的 Translatotron 2 模型能够保留每个说话者的声音,以便随着说话者的转动进行输入。
更新日期:2021-07-20
down
wechat
bug