当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dual-path CNN with Max Gated block for text-based person re-identification
Image and Vision Computing ( IF 4.7 ) Pub Date : 2021-04-22 , DOI: 10.1016/j.imavis.2021.104168
Tinghuai Ma , Mingming Yang , Huan Rong , Yurong Qian , Yuan Tian , Najla Al-Nabhan

Text-based person re-identification (Re-id) is an important task in video surveillance, which consists of retrieving the corresponding person's image given a textual description from a large gallery of images. It is difficult to directly match visual contents with the textual descriptions due to the modality heterogeneity. On the one hand, the textual embedding are not discriminative enough, which originates from the high abstraction of the textual descriptions. One the other hand, Global average pooling (GAP) is commonly utilized to extract more general or smoothed features implicitly but ignores salient local features, which are more important for the cross-modal matching problem. With that in mind, a novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embedding and make visual-textual association concern more on remarkable features of both modalities. The proposed framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching (CMPM) loss and cross-modal projection classification (CMPC) loss to embed the two modalities into a joint feature space. First, the pre-trained language model, BERT, is combined with the convolutional neural network (CNN) to learn better word embedding in the text-to-image matching domain. Second, the global Max pooling (GMP) layer is applied to make the visual textual features focus more on the salient part. To further alleviate the noise of the maxed-pooled features, the gated block (GB) is proposed to produce an attention map that focuses on meaningful features of both modalities. Finally, extensive experiments are conducted on the benchmark dataset, CUHK-PEDES, in which our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%. We also evaluate our method on two generic retrieval datasets (Flickr30K, Oxford-102 Flowers) and obtain the competitive performance. Code is available at https://github.com/voriarty/Dual-path-CNN-with-Max-Gated-block-for-Text-Based-Person-Re-identification



中文翻译:

具有最大门控块的双路径CNN,用于基于文本的人员重新识别

基于文本的人员重新识别(Re-id)是视频监控中的一项重要任务,其中包括从大型图像库中给定文本描述来检索相应人员的图像。由于形式的异质性,很难直接将视觉内容与文本描述进行匹配。一方面,文本嵌入没有足够的区分性,这源于对文本描述的高度抽象。另一方面,全局平均池(GAP)通常用于隐式提取更通用或平滑的特征,但忽略显着的局部特征,这对于跨模态匹配问题更为重要。考虑到这一点,提出了一种新颖的具有最大门控块(DCMG)的双路径CNN,以提取判别性词嵌入,并使视觉文本关联更多地关注两种模式的显着特征。所提出的框架基于两个深度残差CNN,它们通过交叉模式投影匹配(CMPM)损失和交叉模式投影分类(CMPC)损失共同优化,以将这两种模式嵌入到联合特征空间中。首先,将预训练的语言模型BERT与卷积神经网络(CNN)结合使用,以学习在文本到图像匹配域中更好的词嵌入。其次,应用全局最大池化(GMP)层,以使视觉文本特征更多地集中在突出部分上。为了进一步减轻最大化合并功能的噪音,提出了门控块(GB)来生成注意力图,该注意力图着重于两种模式的有意义的特征。最后,我们在基准数据集CUHK-PEDES上进行了广泛的实验,在该实验中,我们的方法获得了55.81%的1级得分,并且比最新方法高出1.3%。我们还在两个通用检索数据集(Flickr30K,Oxford-102 Flowers)上评估了我们的方法,并获得了竞争性能。可在https://github.com/voriarty/Dual-path-CNN-with-Max-Gated-block-for-Text-Based-Person-Re-identification中找到代码 我们还在两个通用检索数据集(Flickr30K,Oxford-102 Flowers)上评估了我们的方法,并获得了竞争性能。可在https://github.com/voriarty/Dual-path-CNN-with-Max-Gated-block-for-Text-Based-Person-Re-identification中找到代码 我们还在两个通用检索数据集(Flickr30K,Oxford-102 Flowers)上评估了我们的方法,并获得了竞争性能。可在https://github.com/voriarty/Dual-path-CNN-with-Max-Gated-block-for-Text-Based-Person-Re-identification中找到代码

更新日期:2021-04-28
down
wechat
bug