Conditional Feature Learning Based Transformer for Text-Based Person Search,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Conditional Feature Learning Based Transformer for Text-Based Person Search
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 9-14-2022 , DOI: 10.1109/tip.2022.3205216
Chenyang Gao ₁ , Guanyu Cai ₂ , Xinyang Jiang ₃ , Feng Zheng ₁ , Jun Zhang ₄ , Yifei Gong ₄ , Fangzhou Lin ₅ , Xing Sun ₄ , Xiang Bai ₆

Affiliation

Text-based person search aims at retrieving the target person in an image gallery using a descriptive sentence of that person. The core of this task is to calculate a similarity score between the pedestrian image and description, which requires inferring the complex latent correspondence between image sub-regions and textual phrases at different scales. Transformer is an intuitive way to model the complex alignment by its self-attention mechanism. Most previous Transformer-based methods simply concatenate image region features and text features as input and learn a cross-modal representation in a brute force manner. Such weakly supervised learning approaches fail to explicitly build alignment between image region features and text features, causing an inferior feature distribution. In this paper, we present CFLT, Conditional Feature Learning based Transformer. It maps the sub-regions and phrases into a unified latent space and explicitly aligns them by constructing conditional embeddings where the feature of data from one modality is dynamically adjusted based on the data from the other modality. The output of our CFLT is a set of similarity scores for each sub-region or phrase rather than a cross-modal representation. Furthermore, we propose a simple and effective multi-modal re-ranking method named Re-ranking scheme by Visual Conditional Feature (RVCF). Benefit from the visual conditional feature and better feature distribution in our CFLT, the proposed RVCF achieves significant performance improvement. Experimental results show that our CFLT outperforms the state-of-the-art methods by 7.03% in terms of top-1 accuracy and 5.01% in terms of top-5 accuracy on the text-based person search dataset.

中文翻译：

用于基于文本的人员搜索的基于条件特征学习的变压器

基于文本的人物搜索旨在使用该人的描述性句子来检索图像库中的目标人物。该任务的核心是计算行人图像和描述之间的相似度得分，这需要推断不同尺度下图像子区域和文本短语之间的复杂潜在对应关系。 Transformer 是一种通过其自注意力机制对复杂对齐进行建模的直观方法。之前大多数基于 Transformer 的方法只是简单地将图像区域特征和文本特征连接起来作为输入，并以强力方式学习跨模态表示。这种弱监督学习方法无法明确地建立图像区域特征和文本特征之间的对齐，导致较差的特征分布。在本文中，我们提出了 CFLT（基于条件特征学习的 Transformer）。它将子区域和短语映射到统一的潜在空间中，并通过构造条件嵌入来显式地对齐它们，其中一种模态的数据特征根据来自另一种模态的数据动态调整。我们的 CFLT 的输出是每个子区域或短语的一组相似性分数，而不是跨模式表示。此外，我们提出了一种简单有效的多模态重排序方法，称为视觉条件特征重排序方案（RVCF）。受益于 CFLT 中的视觉条件特征和更好的特征分布，所提出的 RVCF 实现了显着的性能改进。实验结果表明，在基于文本的人物搜索数据集上，我们的 CFLT 在 top-1 准确率方面比最先进的方法高出 7.03%，在 top-5 准确率方面比最先进的方法高出 5.01%。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11