See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-19 , DOI: arxiv-2001.06810
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

中文翻译：

查看更多，了解更多：使用 Co-Attention Siamese Networks 进行无监督视频对象分割

我们引入了一种称为 CO-attention Siamese Network (COSNet) 的新型网络，以从整体角度解决无监督视频对象分割任务。我们强调视频帧之间内在相关性的重要性，并结合了全局共同注意机制，以进一步改进最先进的基于深度学习的解决方案，该解决方案主要侧重于在短期时间中学习外观和运动的判别前景表示段。我们网络中的共同注意层通过联合计算共同注意响应并将其附加到联合特征空间中，为捕获全局相关性和场景上下文提供了有效和有能力的阶段。我们用成对的视频帧训练 COSNet，这自然会增加训练数据并提高学习能力。在细分阶段，共同注意模型通过一起处理多个参考帧来编码有用的信息，从而更好地推断频繁出现的和显着的前景对象。我们提出了一个统一的端到端可训练框架，其中可以派生出不同的共同注意变体来挖掘视频中的丰富上下文。我们对三个大型基准的广泛实验表明，COSNet 大大优于当前的替代方案。

更新日期：2020-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文