Investigating the Vision Transformer Model for Image Retrieval Tasks,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Investigating the Vision Transformer Model for Image Retrieval Tasks
arXiv - CS - Information Retrieval Pub Date : 2021-01-11 , DOI: arxiv-2101.03771
Socratis Gkelios, Yiannis Boutalis, Savvas A. Chatzichristofis

This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior initialization or preparation. The description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters. In image retrieval tasks, the use of Handcrafted global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods. However, the experimental evaluation conducted in this paper on several benchmarking datasets against 36 state-of-the-art descriptors from the literature demonstrates that a neural network that contains no convolutional layer, such as Vision Transformer, can shape a global descriptor and achieve competitive results. As fine-tuning is not required, the presented methodology's low complexity encourages adoption of the architecture as an image retrieval baseline model, replacing the traditional and well adopted CNN-based approaches and inaugurating a new era in image retrieval approaches.

中文翻译：

研究用于图像检索任务的视觉变压器模型

本文介绍了一种即插即用描述符，无需事先进行初始化或准备即可将其有效地用于图像检索任务。描述方法利用了最近提出的视觉变压器网络，而它不需要任何训练数据来调整参数。在图像检索任务中，过去几年来，基于卷积神经网络（CNN）的方法已非常成功地取代了手工制作的全局和局部描述符的使用。但是，本文针对文献中的36个最新描述符对几个基准数据集进行的实验评估表明，不包含卷积层的神经网络（例如Vision Transformer）可以塑造全局描述符并实现竞争结果。由于不需要微调，

更新日期：2021-01-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>