当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Interaction-Integrated Network for Natural Language Moment Localization
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2021-01-22 , DOI: 10.1109/tip.2021.3052086
Ke Ning , Lingxi Xie , Jianzhuang Liu , Fei Wu , Qi Tian

Natural language moment localization aims at localizing video clips according to a natural language description. The key to this challenging task lies in modeling the relationship between verbal descriptions and visual contents. Existing approaches often sample a number of clips from the video, and individually determine how each of them is related to the query sentence. However, this strategy can fail dramatically, in particular when the query sentence refers to some visual elements that appear outside of, or even are distant from, the target clip. In this paper, we address this issue by designing an Interaction-Integrated Network (I 2 N), which contains a few Interaction-Integrated Cells (I 2 Cs). The idea lies in the observation that the query sentence not only provides a description to the video clip, but also contains semantic cues on the structure of the entire video. Based on this, I 2 Cs go one step beyond modeling short-term contexts in the time domain by encoding long-term video content into every frame feature. By stacking a few I 2 Cs, the obtained network, I 2 N, enjoys an improved ability of inference, brought by both (I) multi-level correspondence between vision and language and (II) more accurate cross-modal alignment. When evaluated on a challenging video moment localization dataset named DiDeMo, I 2 N outperforms the state-of-the-art approach by a clear margin of 1.98%. On other two challenging datasets, Charades-STA and TACoS, I 2 N also reports competitive performance.

中文翻译:

交互集成网络,用于自然语言瞬间定位

自然语言时刻本地化旨在根据自然语言描述对视频剪辑进行本地化。这项艰巨任务的关键在于对口头描述与视觉内容之间的关系进行建模。现有方法通常会从视频中采样一些剪辑,并分别确定每个剪辑与查询语句之间的关系。但是,这种策略可能会严重失败,尤其是当查询语句引用出现在目标剪辑之外甚至远离目标剪辑的某些视觉元素时。在本文中,我们通过设计一个 包含多个交互集成单元(I 2)的交互集成网络(I 2 N)来 解决此问题。 Cs)。该想法在于观察到,查询语句不仅提供了对视频剪辑的描述,而且还包含有关整个视频结构的语义提示。基于此, 通过将长期视频内容编码到每个帧特征中,I 2 C超越了在时域中对短期上下文建模的一步。通过堆叠几个I 2 C,所获得的网络I 2 N的推理能力得到了提高,这是由(I)视觉和语言之间的多级对应关系以及(II)更加准确的跨模态对齐所带来的。在名为DiDeMo的具有挑战性的视频时刻本地化数据集上进行评估时,I 2 N明显优于最新技术,为1.98%。在其他两个具有挑战性的数据集Charades-STA和TACoS上,I 2 N也报告了竞争表现。
更新日期:2021-02-05
down
wechat
bug