当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.4 ) Pub Date : 2022-03-28 , DOI: 10.1109/tcsvt.2022.3162650
Hafiza Sadia Nawaz , Zhensheng Shi , Yanhai Gan , Amanuel Hirpa , Junyu Dong , Haiyong Zheng

Temporal moment localization using natural language (TMLNL) is an emerging issue in computer vision for localizing a specific moment inside a long, untrimmed video. The goal of TMLNL is to obtain the video’s output moment, which is related to the input query in a substantial way. Previous research focused on the visual portion of TMLNL, such as objects, backdrops, and other visual attributes, but natural language processing (NLP) techniques were largely used for the textual portion. A long query requires sufficient context to properly localize moments within a long untrimmed video. Thus, as a consequence of not completely understanding how to handle queries, performances deteriorated, especially when the query was longer. In this paper, we treat the TMLNL challenge as a unique variation of VQA, which equally considers the visual elements by using our proposed VQA joint visual-textual framework (JVTF). However, we also manage complex and long input queries without employing natural language processing (NLP) by improving poorly graded to finely graded distinct granularity representations. Our suggested BCPN searches for insufficient context for long input queries using an approach called query handler (QH) and helps the JVTF find the most relevant moment. Previously, a recurrence of words was caused by increasing the number of encoding layers in transformers, LSTMs, and other NLP techniques; however, our QH ensured that repetition of word locations was reduced. The output of BCPN is combined with JVTF’s guided attention to further improve the end outcome. Therefore, we propose a novel bidirectional context predictor network (BCPN), in addition to a VQA joint visual-textual framework (JVTF), to address the equal importance of videos and queries. Through extensive experiments on three benchmark datasets, we show that the proposed BCPN outperforms the state-of-the-art methods by $IoU = 0.3 (2.65 \%) $ , $IoU = 0.5 (2.49 \%)$ , and $IoU = 0.7 (2.06 \%) $ .

中文翻译:

通过利用视频问答作为特殊变体并绕过语料库的 NLP,通过自然语言进行时间矩定位

使用自然语言 (TMLNL) 进行时间点定位是计算机视觉中的一个新兴问题,用于在未修剪的长视频中定位特定时刻。TMLNL 的目标是获取视频的输出时刻,这与输入查询有很大关系。以前的研究主要集中在 TMLNL 的视觉部分,例如对象、背景和其他视觉属性,但自然语言处理 (NLP) 技术主要用于文本部分。长查询需要足够的上下文来正确定位长未修剪视频中的时刻。因此,由于没有完全理解如何处理查询,性能下降,尤其是当查询较长时。在本文中,我们将 TMLNL 挑战视为 VQA 的独特变体,通过使用我们提出的 VQA 联合视觉-文本框架 (JVTF),它同样考虑了视觉元素。然而,我们还通过将低分级到精细分级的不同粒度表示改进来管理复杂且长的输入查询,而无需使用自然语言处理 (NLP)。我们建议的 BCPN 使用称为查询处理程序 (QH) 的方法为长输入查询搜索不足的上下文,并帮助 JVTF 找到最相关的时刻。以前,单词的重复出现是由于增加了转换器、LSTM 和其他 NLP 技术中的编码层数;但是,我们的 QH 确保减少了单词位置的重复。BCPN 的输出与 JVTF 的引导关注相结合,以进一步改善最终结果。因此,我们提出了一种新颖的双向上下文预测器网络(BCPN),除了 VQA 联合视觉文本框架 (JVTF) 之外,还可以解决视频和查询的同等重要性。通过对三个基准数据集的广泛实验,我们表明所提出的 BCPN 通过以下方式优于最先进的方法 $IoU = 0.3 (2.65 \%) $ , $IoU = 0.5 (2.49 \%)$ , 和 $IoU = 0.7 (2.06 \%) $ .
更新日期:2022-03-28
down
wechat
bug