当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2021-02-17 , DOI: 10.1109/tip.2021.3058614
Wenfei Yang , Tianzhu Zhang , Yongdong Zhang , Feng Wu

Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.

中文翻译:

弱通信时态句法接地的本地通信网络

在实际应用场景中,与完全监督的方法相比,弱监督的时间句子基础具有更好的可伸缩性和实用性。然而,大多数现有方法不能很好地对细粒度的视频-文本本地通信进行建模,并且没有用于通信学习的有效监督信息,因此产生了令人不满意的性能。为了解决上述问题,我们提出了一个端到端的本地通信网络(LCNet),用于弱监督的时间句子基础。拟议的LCNet具有许多优点。首先,我们以分层的方式表示视频和文本特征,以对细粒度的视频-文本对应关系进行建模。其次,我们设计了一个自我监督的周期一致损失作为视频和文本匹配的学习指南。据我们所知,这是通过自我监督学习充分探索视频和文本之间的细粒度对应关系以进行临时句基础的第一项工作。在两个基准数据集上的大量实验结果表明,所提出的LCNet明显优于现有的弱监督方法。
更新日期:2021-03-05
down
wechat
bug