当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Image captioning via proximal policy optimization
Image and Vision Computing ( IF 4.2 ) Pub Date : 2021-02-08 , DOI: 10.1016/j.imavis.2021.104126
Le Zhang , Yanshuo Zhang , Xin Zhao , Zexiao Zou

Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fewer works are proposed for the RL phase. Motivated by one recent state-of-the-art architecture X-Transformer [Pan et al., CVPR 2020], we apply PPO (Proximal Policy Optimization) to it to establish a further improvement. However, trivially applying a vanilla policy gradient objective function with the clipping form of PPO would not improve the result. Therefore, we introduce certain modifications. We show that PPO is capable of enforcing trust-region constraints effectively. Also, experimentally performance decreases when PPO is combined with the regularization technique dropout. We analyze the possible reason in terms of KL-divergence of RL policies. As to the baseline adopted in the policy gradient estimator of RL, it is generally sentence-level. So all words in the same sentence use the same baseline in the gradient estimator. We instead use a word-level baseline via Monte-Carlo estimation. Thus, different words can have different baseline values. With all these, by fine-tuning a pre-trained X-Transformer, we train a single model achieving a competitive result of 133.3% on the MSCOCO Karpathy test set. Source code is available at https://github.com/lezhang-thu/xtransformer-ppo.



中文翻译:

通过近端策略优化对图像进行字幕

图像字幕是以自然语言生成图像字幕的任务。培训通常包括两个阶段,首先是使XE(交叉熵)损失最小化,然后是对CIDEr分数进行RL(强化学习)。尽管神经体系结构有许多创新,但在RL阶段提出的工作较少。受最新的最新架构X-Transformer的启发[Pan等人,CVPR 2020],我们将PPO(近端策略优化)应用于它来建立进一步的改进。但是,简单地将原始政策梯度目标函数与PPO的裁剪形式结合使用不会改善结果。因此,我们进行某些修改。我们表明,PPO能够有效地执行信任区域约束。还,当PPO与正则化技术缺失结合使用时,实验上的性能会降低。我们根据RL政策的KL差异分析了可能的原因。至于RL的策略梯度估计器中采用的基线,通常是句子级别的。因此,同一句子中的所有单词在梯度估计器中都使用相同的基线。相反,我们通过蒙特卡洛估计使用词级基线。因此,不同的词可以具有不同的基线值。通过所有这些,通过微调预训练的X变压器,我们训练了一个模型,在MSCOCO Karpathy测试装置上获得了133.3%的竞争结果。源代码位于 它通常是句子级别的。因此,同一句子中的所有单词在梯度估计器中都使用相同的基线。相反,我们通过蒙特卡洛估计使用词级基线。因此,不同的词可以具有不同的基线值。通过所有这些,通过微调预训练的X变压器,我们训练了一个模型,在MSCOCO Karpathy测试装置上获得了133.3%的竞争结果。源代码位于 它通常是句子级别的。因此,同一句子中的所有单词在梯度估计器中都使用相同的基线。相反,我们通过蒙特卡洛估计使用词级基线。因此,不同的词可以具有不同的基线值。通过所有这些,通过微调预训练的X变压器,我们训练了一个模型,在MSCOCO Karpathy测试装置上获得了133.3%的竞争结果。源代码位于https://github.com/lezhang-thu/xtransformer-ppo

更新日期:2021-02-16
down
wechat
bug