当前位置:
X-MOL 学术
›
arXiv.cs.MM
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
arXiv - CS - Multimedia Pub Date : 2020-06-16 , DOI: arxiv-2006.09199 Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
arXiv - CS - Multimedia Pub Date : 2020-06-16 , DOI: arxiv-2006.09199 Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Current methods for learning visually grounded language from videos often
rely on time-consuming and expensive data collection, such as human annotated
textual summaries or machine generated automatic speech recognition
transcripts. In this work, we introduce Audio-Video Language Network (AVLnet),
a self-supervised network that learns a shared audio-visual embedding space
directly from raw video inputs. We circumvent the need for annotation and
instead learn audio-visual language representations directly from randomly
segmented video clips and their raw audio waveforms. We train AVLnet on
publicly available instructional videos and evaluate our model on video clip
and language retrieval tasks on three video datasets. Our proposed model
outperforms several state-of-the-art text-video baselines by up to 11.8% in a
video clip retrieval task, despite operating on the raw audio instead of
manually annotated text captions. Further, we show AVLnet is capable of
integrating textual information, increasing its modularity and improving
performance by up to 20.3% on the video clip retrieval task. Finally, we
perform analysis of AVLnet's learned representations, showing our model has
learned to relate visual objects with salient words and natural sounds.
中文翻译:
AVLnet:从教学视频中学习视听语言表示
当前从视频中学习基于视觉的语言的方法通常依赖于耗时且昂贵的数据收集,例如人工注释的文本摘要或机器生成的自动语音识别成绩单。在这项工作中,我们引入了音频-视频语言网络 (AVLnet),这是一种自监督网络,可直接从原始视频输入中学习共享的视听嵌入空间。我们避免了注释的需要,而是直接从随机分割的视频剪辑及其原始音频波形中学习视听语言表示。我们在公开可用的教学视频上训练 AVLnet,并在三个视频数据集上评估我们的视频剪辑和语言检索任务模型。我们提出的模型在视频剪辑检索任务中优于几个最先进的文本视频基线高达 11.8%,尽管操作原始音频而不是手动注释的文本标题。此外,我们展示了 AVLnet 能够集成文本信息,增加其模块化并将视频剪辑检索任务的性能提高多达 20.3%。最后,我们对 AVLnet 的学习表示进行分析,表明我们的模型已经学会将视觉对象与显着的单词和自然声音联系起来。
更新日期:2020-06-17
中文翻译:
AVLnet:从教学视频中学习视听语言表示
当前从视频中学习基于视觉的语言的方法通常依赖于耗时且昂贵的数据收集,例如人工注释的文本摘要或机器生成的自动语音识别成绩单。在这项工作中,我们引入了音频-视频语言网络 (AVLnet),这是一种自监督网络,可直接从原始视频输入中学习共享的视听嵌入空间。我们避免了注释的需要,而是直接从随机分割的视频剪辑及其原始音频波形中学习视听语言表示。我们在公开可用的教学视频上训练 AVLnet,并在三个视频数据集上评估我们的视频剪辑和语言检索任务模型。我们提出的模型在视频剪辑检索任务中优于几个最先进的文本视频基线高达 11.8%,尽管操作原始音频而不是手动注释的文本标题。此外,我们展示了 AVLnet 能够集成文本信息,增加其模块化并将视频剪辑检索任务的性能提高多达 20.3%。最后,我们对 AVLnet 的学习表示进行分析,表明我们的模型已经学会将视觉对象与显着的单词和自然声音联系起来。