当前位置: X-MOL 学术J. Electron. Imaging › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification
Journal of Electronic Imaging ( IF 1.1 ) Pub Date : 2020-08-28 , DOI: 10.1117/1.jei.29.4.043028
Zijie Wang 1 , Aichun Zhu 1 , Zhe Zheng 1 , Jing Jin 1 , Zhouxin Xue 1 , Gang Hua 2
Affiliation  

Abstract. Given a natural language description, description-based person re-identification aims to retrieve images of the matched person from a large-scale visual database. Due to the existing modality heterogeneity, it is challenging to measure the cross-modal similarity between images and text descriptions. Many of the existing approaches usually utilize a deep-learning model to encode local and global fine-grained features with a strict uniform partition strategy. This breaks the part coherence, making it difficult to capture meaningful information from the within-part and semantic information among body parts. To address this issue, we proposed an inner-cross-modal attentional multigranular network (IMG-Net) to incorporate inner-modal self-attention and cross-modal hard-region attention with the fine-grained model for extracting the multigranular semantic information. Specifically, the inner-modal self-attention module is proposed to address the within-part consistency broken problem using both spatial-wise and channel-wise information. Following it is a multigranular feature extraction module, which is used to extract rich local and global visual and textual features with the help of group normalization (GN). Then a cross-modal hard-region attention module is proposed to obtain the local visual representation and phrase representation. Furthermore, a GN is used instead of batch normalization for the accurate batch statistics estimation. Comprehensive experiments with ablation analysis demonstrate that IMG-Net achieves the state-of-the-art performance on the CUHK-PEDES dataset and outperforms other previous methods significantly.

中文翻译:

IMG-Net:用于基于描述的行人重新识别的内跨模态注意多粒度网络

摘要。给定自然语言描述,基于描述的人员重新识别旨在从大规模视觉数据库中检索匹配人员的图像。由于现有的模态异质性,测量图像和文本描述之间的跨模态相似性具有挑战性。许多现有的方法通常使用深度学习模型来编码具有严格统一分区策略的局部和全局细粒度特征。这打破了部位的连贯性,使得很难从部位内和身体部位之间的语义信息中获取有意义的信息。为了解决这个问题,我们提出了一种内跨模态注意多粒度网络(IMG-Net),将内模态自注意力和跨模态硬区域注意力与细粒度模型结合起来,用于提取多粒度语义信息。具体来说,提出了内模态自注意力模块来解决使用空间和通道信息的部分内一致性破坏问题。紧随其后的是多粒度特征提取模块,用于借助组归一化(GN)提取丰富的局部和全局视觉和文本特征。然后提出了跨模态硬区域注意模块来获得局部视觉表示和短语表示。此外,使用 GN 代替批量归一化来进行准确的批量统计估计。
更新日期:2020-08-28
down
wechat
bug