Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-04-23 , DOI: 10.1016/j.knosys.2020.105920
Xiangqing Shen , Bing Liu , Yong Zhou , Jiaqi Zhao , Mingming Liu

Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.

中文翻译：

通过变分自动编码器和强化学习进行遥感图像字幕

图像字幕，即生成给定图像的自然语义描述，是机器理解图像内容的一项重要任务。遥感图像字幕是该领域的一部分。当前大多数遥感图像字幕模型都存在过拟合问题，无法利用图像中的语义信息。为此，我们提出了一个V ariational自动编码器和[R einforcement学习基础牛逼WO阶段中号ULTI-任务学习中号odel（VRTMM）用于遥感图像字幕任务。在第一阶段，我们与变分自动编码器一起对CNN进行微调。在第二阶段，Transformer使用空间和语义特征生成文本描述。然后应用强化学习来提高生成句子的质量。我们的模型在遥感图像字幕数据集的所有七个得分上都大大超过了以前的记录水平。实验结果表明我们的模型对遥感图像字幕有效，并且达到了最新的结果。

更新日期：2020-04-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11