当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparative evaluation of CNN architectures for Image Caption Generation
arXiv - CS - Artificial Intelligence Pub Date : 2021-02-23 , DOI: arxiv-2102.11506
Sulabh Katiyar, Samir Kumar Borgohain

Aided by recent advances in Deep Learning, Image Caption Generation has seen tremendous progress over the last few years. Most methods use transfer learning to extract visual information, in the form of image features, with the help of pre-trained Convolutional Neural Network models followed by transformation of the visual information using a Caption Generator module to generate the output sentences. Different methods have used different Convolutional Neural Network Architectures and, to the best of our knowledge, there is no systematic study which compares the relative efficacy of different Convolutional Neural Network architectures for extracting the visual information. In this work, we have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks: the first based on Neural Image Caption (NIC) generation model and the second based on Soft-Attention framework. We observe that model complexity of Convolutional Neural Network, as measured by number of parameters, and the accuracy of the model on Object Recognition task does not necessarily co-relate with its efficacy on feature extraction for Image Caption Generation task.

中文翻译:

CNN架构用于图像字幕生成的比较评估

借助深度学习的最新进展,图像标题生成在过去几年中取得了巨大进步。大多数方法都使用转移学习在预先训练的卷积神经网络模型的帮助下以图像特征的形式提取视觉信息,然后使用字幕生成器模块转换视觉信息以生成输出语句。不同的方法使用了不同的卷积神经网络体系结构,据我们所知,目前还没有系统的研究来比较不同卷积神经网络体系结构在提取视觉信息方面的相对功效。在这项工作中,我们在两种流行的图像标题生成框架上评估了17种不同的卷积神经网络:第一个基于神经图像字幕(NIC)生成模型,第二个基于软注意力框架。我们观察到卷积神经网络的模型复杂度(通过参数数量来衡量)以及对象识别任务模型的准确性不一定与它在图像标题生成任务的特征提取中的功效相关。
更新日期:2021-02-24
down
wechat
bug