DualLip: A System for Joint Lip Reading and Generation,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DualLip: A System for Joint Lip Reading and Generation
arXiv - CS - Multimedia Pub Date : 2020-09-12 , DOI: arxiv-2009.05784
Weicong Chen, Xu Tan, Yingce Xia, Tao Qin, Yu Wang, Tie-Yan Liu

Lip reading aims to recognize text from talking lip, while lip generation aims to synthesize talking lip according to text, which is a key component in talking face generation and is a dual task of lip reading. In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. The key ideas of the DualLip include: 1) Generate lip video from unlabeled text with a lip generation model, and use the pseudo pairs to improve lip reading; 2) Generate text from unlabeled lip video with a lip reading model, and use the pseudo pairs to improve lip generation. We further extend DualLip to talking face generation with two additionally introduced components: lip to face generation and text to speech generation. Experiments on GRID and TCD-TIMIT demonstrate the effectiveness of DualLip on improving lip reading, lip generation, and talking face generation by utilizing unlabeled data. Specifically, the lip generation model in our DualLip system trained with only10% paired data surpasses the performance of that trained with the whole paired data. And on the GRID benchmark of lip reading, we achieve 1.16% character error rate and 2.71% word error rate, outperforming the state-of-the-art models using the same amount of paired data.

中文翻译：

DualLip：联合唇读和生成系统

唇读旨在从说话的唇中识别文本，而唇生成旨在根据文本合成说话的唇，这是说话人脸生成的关键组成部分，是唇读的双重任务。在本文中，我们开发了 DualLip，这是一个系统，通过利用任务二元性并使用未标记的文本和唇部视频数据来共同改进唇部阅读和生成。DualLip 的主要思想包括： 1) 使用唇形生成模型从未标记的文本中生成唇形视频，并使用伪对来改善唇形阅读；2) 使用唇读模型从未标记的唇部视频中生成文本，并使用伪对来改进唇部生成。我们通过两个额外引入的组件将 DualLip 进一步扩展到会说话的面部生成：唇对脸生成和文本到语音生成。在 GRID 和 TCD-TIMIT 上的实验证明了 DualLip 通过利用未标记数据改善唇读、唇生成和说话人脸生成的有效性。具体来说，我们的 DualLip 系统中的唇部生成模型仅使用 10% 的配对数据进行训练，其性能超过了使用整个配对数据训练的模型的性能。在唇读的 GRID 基准测试中，我们实现了 1.16% 的字符错误率和 2.71% 的单词错误率，优于使用相同数量配对数据的最新模型。

更新日期：2020-09-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>