Picture it in your mind: generating high level visual representations from textual descriptions,Information Retrieval Journal

当前位置： X-MOL 学术 › Inf. Retrieval J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Picture it in your mind: generating high level visual representations from textual descriptions
Information Retrieval Journal ( IF 1.7 ) Pub Date : 2017-10-14 , DOI: 10.1007/s10791-017-9318-6
Fabio Carrara , Andrea Esuli , Tiziano Fagni , Fabrizio Falchi , Alejandro Moreo Fernández

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.

中文翻译：

在您的脑海中描绘出来：从文字描述中生成高级视觉表示

在本文中，当查询是用户正在寻找的图像的简短文字描述时，我们将解决图像搜索的问题。通过学习将文本查询转换为视觉表示，我们选择将实际搜索过程实现为视觉特征空间中的相似性搜索。在视觉特征空间中进行搜索具有以下优点：对翻译模型的任何更新都不需要重新处理在其上执行搜索的（通常是巨大的）图像集合。我们提出了各种复杂性不断提高的神经网络模型，这些模型学会从简短的描述性文本中生成视觉特征空间中的高级视觉表示，例如ResNet-152的pool5层或受过AlexNet训练的fc6-fc7层在ILSVRC12和Places数据库上。该Text2Vis我们探索的模型包括（1）相对简单的回归网络，依靠词袋表示来表示文本描述符；（2）深度递归网络，对词序敏感；（3）广泛而深入的模型，将堆叠的LSTM深度网络与宽泛的回归网络结合在一起。我们将我们提出的模型与其他搜索策略进行了比较，其中还包括利用最新字幕生成模型来索引图像集合的文本搜索方法。

更新日期：2017-10-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11