VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
arXiv - CS - Multimedia Pub Date : 2021-04-22 , DOI: arxiv-2104.11178
Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.

中文翻译：

VATT：用于从原始视频，音频和文本进行多模式自我监督学习的变压器

我们提出了使用无卷积的Transformer架构从未标记的数据中学习多模式表示的框架。具体来说，我们的视频音频文本变压器（VATT）将原始信号作为输入，并提取出足够丰富的多模式表示形式，以使各种下游任务受益。我们使用多模式对比损失从头到尾训练VATT，并通过视频动作识别，音频事件分类，图像分类和文本到视频检索的下游任务评估其性能。此外，我们通过在三种模态之间共享权重来研究与模态不可知的单骨干变压器。我们显示，在下游任务中，无卷积的VATT优于基于ConvNet的最新体系结构。特别是VATT' Vision Transformer在Kinetics-400上的top-1准确度达到了82.1％，在Kinetics-600上达到了83.6％，在Moments in Time中达到了41.1％，这是新记录，同时避免了有监督的预训练。转移到图像分类可在ImageNet上获得78.7％的top-1准确性，而通过从头训练相同的Transformer则可达到64.7％，这表明我们的模型具有通用性，尽管视频和图像之间存在领域差距。VATT的音频变压器还通过在AudioSet上实现39.4％的mAP而无需任何监督的预培训，从而在基于波形的音频事件识别方面也创造了新的记录。从零开始训练同一个Transformer可获得7％的收益，显示了我们模型的通用性，尽管视频和图像之间存在领域差距。VATT的音频变压器还通过在AudioSet上实现39.4％的mAP而无需任何监督的预培训，从而在基于波形的音频事件识别方面也创造了新的记录。从零开始训练同一个Transformer可获得7％的收益，显示了我们模型的通用性，尽管视频和图像之间存在领域差距。VATT的音频变压器还通过在AudioSet上实现39.4％的mAP而无需任何监督的预培训，从而在基于波形的音频事件识别方面也创造了新的记录。

更新日期：2021-04-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>