Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
arXiv - CS - Multimedia Pub Date : 2021-06-26 , DOI: arxiv-2106.14014
Pulkit Tandon, Shubham Chandak, Pat Pataranutaporn, Yimeng Liu, Anesu M. Mapuranga, Pattie Maes, Tsachy Weissman, Misha Sra

Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on network performance for live and pre-recorded content, providing broader access to multimedia content worldwide. In this work, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning based voice cloning and lip syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders), while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The code for this work is available at https://github.com/tpulkit/txt2vid.git.

中文翻译：

Txt2Vid：通过文本对说话的头部视频进行超低比特率压缩

视频代表了当今互联网流量的大部分，导致在生成更高质量的内容、传输更大的文件大小和支持网络基础设施之间进行持续的技术军备竞赛。除此之外，最近的 COVID-19 大流行推动了视频会议工具的使用激增。由于视频占用大量带宽（约 100 Kbps 到几 Mbps），改进的视频压缩可以对直播和预录内容的网络性能产生重大影响，从而提供对全球多媒体内容的更广泛访问。在这项工作中，我们提出了一种新的视频压缩管道，称为 Txt2Vid，它通过将网络摄像头视频（“谈话头视频”）压缩为文本转录来显着降低数据传输速率。使用基于深度学习的语音克隆和唇形同步模型的最新进展，文本被传输并解码为原始视频的真实重建。与标准音频-视频编解码器（编码器-解码器）相比，我们的生成管道实现了比特率降低两到三个数量级，同时基于用户（n = 242）的主观评估保持等效的体验质量在线研究。这项工作的代码可在 https://github.com/tpulkit/txt2vid.git 获得。同时基于在线研究中用户 (n=242) 的主观评估保持等效的体验质量。这项工作的代码可在 https://github.com/tpulkit/txt2vid.git 获得。同时基于在线研究中用户 (n=242) 的主观评估保持等效的体验质量。这项工作的代码可在 https://github.com/tpulkit/txt2vid.git 获得。

更新日期：2021-06-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文