Clothes Image Caption Generation with Attribute Detection and Visual Attention Model,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clothes Image Caption Generation with Attribute Detection and Visual Attention Model
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2020-12-10 , DOI: 10.1016/j.patrec.2020.12.001
Xianrui Li , Zhiling Ye , Zhao Zhang , Mingbo Zhao

Fashion is a multi-billion-dollar industry, which is directly related to social, cultural, and economic implications in the real world. While computer vision has demonstrated remarkable success in the applications of the fashion domain, natural language processing technology has become contributed in the area, so that it can build the connection between clothes image and human semantic understandings. An element work for combing images and language understanding is how to generate a natural language sentence that accurately summarizes the contents of a clothes image. In this paper, we develop a joint attribute detection and visual attention framework for clothes image captioning. Specifically, in order to involve more attributes of clothes to learn, we first utilize a pre-trained Convolutional Neural Network (CNN) to learn the feature that can characterize more information about clothing attribute. Based on such learned feature, we then adopt an encoder/decoder framework, where we first encoder the feature of clothes and then and input it to a language Long Short-Term Memory(LSTM) model for decoding the clothes descriptions. The method greatly enhances the performance of clothes image captioning and reduces the misleading attention. Extensive simulations based on real-world data verify the effectiveness of the proposed method.

中文翻译：

具有属性检测和视觉注意模型的衣服图像标题生成

时装业是一个价值数十亿美元的产业，与现实世界中的社会，文化和经济影响直接相关。尽管计算机视觉已在时尚领域的应用中取得了巨大成功，但自然语言处理技术已在该领域发挥了重要作用，因此它可以在服装形象和人类语义理解之间建立联系。组合图像和语言理解的一项基本工作是如何生成自然语言语句，以准确总结衣服图像的内容。在本文中，我们开发了用于服装图像字幕的联合属性检测和视觉注意框架。具体来说，为了让服装具有更多的学习属性，我们首先利用预训练的卷积神经网络（CNN）来学习可以表征有关服装属性的更多信息的特征。基于这种学习到的特征，然后我们采用编码器/解码器框架，首先编码衣服的特征，然后将其输入到语言长短期记忆（LSTM）模型中，以解码衣服的描述。该方法大大提高了服装图像字幕的性能，并减少了误导注意力。基于实际数据的大量仿真证明了该方法的有效性。首先，我们对衣服的特征进行编码，然后将其输入到语言长期短期记忆（LSTM）模型中，以对衣服的描述进行解码。该方法大大提高了服装图像字幕的性能，并减少了误导注意力。基于实际数据的大量仿真证明了该方法的有效性。首先，我们对衣服的特征进行编码，然后将其输入到语言长期短期记忆（LSTM）模型中，以对衣服的描述进行解码。该方法大大提高了服装图像字幕的性能，并减少了误导注意力。基于实际数据的大量仿真证明了该方法的有效性。

更新日期：2020-12-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>