Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation,ACM Transactions on Intelligent Systems and Technology

当前位置： X-MOL 学术 › ACM Trans. Intell. Syst. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation
ACM Transactions on Intelligent Systems and Technology ( IF 7.2 ) Pub Date : 2021-02-26 , DOI: 10.1145/3446345
Qianli Zhou ₁ , Tianrui Hui ₂ , Rong Wang ₁ , Haimiao Hu ₃ , Si Liu ₃

Affiliation

The goal of referring image segmentation is to identify the object matched with an input natural language expression. Previous methods only support English descriptions, whereas Chinese is also broadly used around the world, which limits the potential application of this task. Therefore, we propose to extend existing datasets with Chinese descriptions and preprocessing tools for training and evaluating bilingual referring segmentation models. In addition, previous methods also lack the ability to collaboratively learn channel-wise and spatial-wise cross-modal attention to well align visual and linguistic modalities. To tackle these limitations, we propose a Linguistic Excitation module to excite image channels guided by language information and a Linguistic Aggregation module to aggregate multimodal information based on image-language relationships. Since different levels of features from the visual backbone encode rich visual information, we also propose a Cross-Level Attentive Fusion module to fuse multilevel features gated by language information. Extensive experiments on four English and Chinese benchmarks show that our bilingual referring image segmentation model outperforms previous methods.

中文翻译：

用于双语参考图像分割的注意力激发和聚合

参考图像分割的目标是识别与输入的自然语言表达匹配的对象。以前的方法只支持英文描述，而中文也在世界范围内广泛使用，这限制了这项任务的潜在应用。因此，我们建议使用中文描述和预处理工具扩展现有数据集，以训练和评估双语参考分割模型。此外，以前的方法还缺乏协同学习通道和空间跨模态注意的能力，以很好地对齐视觉和语言模态。为了解决这些限制，我们提出了一个语言激发模块来激发由语言信息引导的图像通道和一个语言聚合模块来聚合基于图像-语言关系的多模态信息。由于来自视觉骨干的不同层次的特征编码了丰富的视觉信息，我们还提出了一个跨层次注意力融合模块来融合由语言信息门控的多层次特征。在四个英文和中文基准上的广泛实验表明，我们的双语参考图像分割模型优于以前的方法。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11