Towards Automatic Face-to-Face Translation,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Automatic Face-to-Face Translation
arXiv - CS - Multimedia Pub Date : 2020-03-01 , DOI: arxiv-2003.00418
Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, C.V. Jawahar

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact on multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available. Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0 Code and models: https://github.com/Rudrabha/LipGAN

中文翻译：

走向自动面对面翻译

鉴于自动机器翻译系统最近取得的突破，我们提出了一种称为“面对面翻译”的新方法。随着今天的数字通信变得越来越可视化，我们认为需要一种系统，可以自动将使用语言 A 说话的人的视频翻译成目标语言 B，并具有逼真的唇形同步。在这项工作中，我们为这个问题创建了一个自动管道，并展示了它对多个实际应用程序的影响。首先，我们通过将语音和语言中的多个现有模块组合在一起，构建了一个有效的语音到语音翻译系统。然后，我们通过合并一个新颖的视觉模块 LipGAN 来从翻译的音频中生成逼真的说话人脸，从而转向“面对面翻译”。LipGAN 在标准 LRW 测试集上的定量评估表明，它在所有标准指标上都明显优于现有方法。我们还对我们的面对面翻译流程进行了多次人工评估，并表明它可以显着改善跨语言的多模式内容消费和交互的整体用户体验。代码、模型和演示视频已公开。演示视频：https://www.youtube.com/watch?v=aHG6Oei8jF0 代码和模型：https://github.com/Rudrabha/LipGAN 经过多次人工评估，并表明它可以显着改善跨语言的多模式内容消费和交互的整体用户体验。代码、模型和演示视频已公开。演示视频：https://www.youtube.com/watch?v=aHG6Oei8jF0 代码和模型：https://github.com/Rudrabha/LipGAN 经过多次人工评估，并表明它可以显着改善跨语言的多模式内容消费和交互的整体用户体验。代码、模型和演示视频已公开。演示视频：https://www.youtube.com/watch?v=aHG6Oei8jF0 代码和模型：https://github.com/Rudrabha/LipGAN

更新日期：2020-03-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>