AVLnet: Learning Audio-Visual Language Representations from Instructional Videos,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
arXiv - CS - Multimedia Pub Date : 2020-06-16 , DOI: arxiv-2006.09199
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We circumvent the need for annotation and instead learn audio-visual language representations directly from randomly segmented video clips and their raw audio waveforms. We train AVLnet on publicly available instructional videos and evaluate our model on video clip and language retrieval tasks on three video datasets. Our proposed model outperforms several state-of-the-art text-video baselines by up to 11.8% in a video clip retrieval task, despite operating on the raw audio instead of manually annotated text captions. Further, we show AVLnet is capable of integrating textual information, increasing its modularity and improving performance by up to 20.3% on the video clip retrieval task. Finally, we perform analysis of AVLnet's learned representations, showing our model has learned to relate visual objects with salient words and natural sounds.

中文翻译：

AVLnet：从教学视频中学习视听语言表示

当前从视频中学习基于视觉的语言的方法通常依赖于耗时且昂贵的数据收集，例如人工注释的文本摘要或机器生成的自动语音识别成绩单。在这项工作中，我们引入了音频-视频语言网络 (AVLnet)，这是一种自监督网络，可直接从原始视频输入中学习共享的视听嵌入空间。我们避免了注释的需要，而是直接从随机分割的视频剪辑及其原始音频波形中学习视听语言表示。我们在公开可用的教学视频上训练 AVLnet，并在三个视频数据集上评估我们的视频剪辑和语言检索任务模型。我们提出的模型在视频剪辑检索任务中优于几个最先进的文本视频基线高达 11.8%，尽管操作原始音频而不是手动注释的文本标题。此外，我们展示了 AVLnet 能够集成文本信息，增加其模块化并将视频剪辑检索任务的性能提高多达 20.3%。最后，我们对 AVLnet 的学习表示进行分析，表明我们的模型已经学会将视觉对象与显着的单词和自然声音联系起来。

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文