当前位置:
X-MOL 学术
›
arXiv.cs.SD
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Translatotron 2: Robust direct speech-to-speech translation
arXiv - CS - Sound Pub Date : 2021-07-19 , DOI: arxiv-2107.08661 Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
arXiv - CS - Sound Pub Date : 2021-07-19 , DOI: arxiv-2107.08661 Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
We present Translatotron 2, a neural direct speech-to-speech translation
model that can be trained end-to-end. Translatotron 2 consists of a speech
encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention
module that connects all the previous three components. Experimental results
suggest that Translatotron 2 outperforms the original Translatotron by a large
margin in terms of translation quality and predicted speech naturalness, and
drastically improves the robustness of the predicted speech by mitigating
over-generation, such as babbling or long pause. We also propose a new method
for retaining the source speaker's voice in the translated speech. The trained
model is restricted to retain the source speaker's voice, and unlike the
original Translatotron, it is not able to generate speech in a different
speaker's voice, making the model more robust for production deployment, by
mitigating potential misuse for creating spoofing audio artifacts. When the new
method is used together with a simple concatenation-based data augmentation,
the trained Translatotron 2 model is able to retain each speaker's voice for
input with speaker turns.
中文翻译:
Translatotron 2:强大的直接语音到语音翻译
我们提出了 Translatotron 2,这是一种可以进行端到端训练的神经直接语音到语音翻译模型。Translatotron 2 由一个语音编码器、一个音素解码器、一个梅尔谱合成器和一个连接前三个组件的注意力模块组成。实验结果表明,Translatotron 2 在翻译质量和预测语音自然度方面大大优于原始 Translatotron,并通过减少过度生成(如牙牙学语或长时间停顿)大大提高了预测语音的鲁棒性。我们还提出了一种在翻译后的语音中保留源说话者声音的新方法。训练后的模型仅限于保留源说话者的声音,并且与原始 Translatotron 不同,它无法生成不同说话者的语音” s 声音,通过减少潜在的误用来创建欺骗音频伪像,使模型在生产部署中更加健壮。当新方法与简单的基于串联的数据增强一起使用时,经过训练的 Translatotron 2 模型能够保留每个说话者的声音,以便随着说话者的转动进行输入。
更新日期:2021-07-20
中文翻译:
Translatotron 2:强大的直接语音到语音翻译
我们提出了 Translatotron 2,这是一种可以进行端到端训练的神经直接语音到语音翻译模型。Translatotron 2 由一个语音编码器、一个音素解码器、一个梅尔谱合成器和一个连接前三个组件的注意力模块组成。实验结果表明,Translatotron 2 在翻译质量和预测语音自然度方面大大优于原始 Translatotron,并通过减少过度生成(如牙牙学语或长时间停顿)大大提高了预测语音的鲁棒性。我们还提出了一种在翻译后的语音中保留源说话者声音的新方法。训练后的模型仅限于保留源说话者的声音,并且与原始 Translatotron 不同,它无法生成不同说话者的语音” s 声音,通过减少潜在的误用来创建欺骗音频伪像,使模型在生产部署中更加健壮。当新方法与简单的基于串联的数据增强一起使用时,经过训练的 Translatotron 2 模型能够保留每个说话者的声音,以便随着说话者的转动进行输入。