Description-based person search with multi-grained matching networks,Displays

当前位置： X-MOL 学术 › Displays › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Description-based person search with multi-grained matching networks
Displays ( IF 3.7 ) Pub Date : 2021-06-15 , DOI: 10.1016/j.displa.2021.102039
Ji Zhu , Hua Yang , Jia Wang , Wenjun Zhang

Description-based person search aims to retrieve a person in the image database based on a description about that person. It is a challenging task since the visual image and the textual description belong to different modalities. To fully capture the relevance between person images and textual descriptions, we propose a multi-grained framework with three branches for visual-textual matching. Specifically, in the global-grained branch, we extract global contexts from the entire images and descriptions. In the fine-grained branch, we adopt visual human parsing and linguistic parsing to split images and descriptions into semantic components related to different body parts. We design two attention mechanisms including segmentation-based and linguistics-based attention to align visual and textual semantic components for fine-grained matching. To further exploit the spatial relations between fine-grained semantic components, we construct a body graph in the coarse-grained branch and exploit graph convolutional neural networks to aggregate fine-grained components into coarse-grained representations. The visual and textual representations learned by three branches are complementary to each other which enhance the visual-textual matching performance. Experimental results on the CUHK-PEDES dataset show that our approach performs favorably against state-of-the-art description-based person search methods.

中文翻译：

具有多粒度匹配网络的基于描述的人物搜索

基于描述的人物搜索旨在根据有关该人物的描述在图像数据库中检索该人物。这是一项具有挑战性的任务，因为视觉图像和文本描述属于不同的模态。为了充分捕捉人物图像和文本描述之间的相关性，我们提出了一个具有三个分支的多粒度框架，用于视觉-文本匹配。具体来说，在全局粒度分支中，我们从整个图像和描述中提取全局上下文。在细粒度分支中，我们采用视觉人体解析和语言解析将图像和描述拆分为与不同身体部位相关的语义成分。我们设计了两种注意力机制，包括基于分割和基于语言学的注意力，以对齐视觉和文本语义组件以进行细粒度匹配。为了进一步利用细粒度语义分量之间的空间关系，我们在粗粒度分支中构建了一个主体图，并利用图卷积神经网络将细粒度分量聚合为粗粒度表示。三个分支学习的视觉和文本表示相互补充，增强了视觉-文本匹配性能。在 CUHK-PEDES 数据集上的实验结果表明，我们的方法与最先进的基于描述的人物搜索方法相比表现出色。三个分支学习的视觉和文本表示相互补充，增强了视觉-文本匹配性能。在 CUHK-PEDES 数据集上的实验结果表明，我们的方法与最先进的基于描述的人物搜索方法相比表现出色。三个分支学习的视觉和文本表示相互补充，增强了视觉-文本匹配性能。在 CUHK-PEDES 数据集上的实验结果表明，我们的方法与最先进的基于描述的人物搜索方法相比表现出色。

更新日期：2021-06-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11