当前位置: X-MOL 学术Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dual Position Relationship Transformer for Image Captioning
Big Data ( IF 4.6 ) Pub Date : 2022-12-07 , DOI: 10.1089/big.2021.0262
Yaohan Wang 1 , Wenhua Qian 1, 2 , Rencan Nie 1, 2 , Dan Xu 1 , Jinde Cao 3 , Pyoungwon Kim 4
Affiliation  

Employing feature vectors extracted from the target detector has been shown to be effective in improving the performance of image captioning. However, it is considered that existing framework suffers from the deficiency of insufficient information extraction, such as positional relationships; it is very important to judge the relationship between objects. To fill this gap, we present a dual position relationship transformer (DPR) for image captioning; the architecture improves the image information extraction and description coding steps: it first calculates the relative position (RP) and absolute position (AP) between objects, and integrates the dual position relationship information into self-attention. Specifically, convolutional neural network (CNN) and faster R-CNN are applied to extract image features and target detection, then to calculate the RP and AP of the generated object boxes. The former is expressed in coordinate form, and the latter is calculated by sinusoidal encoding. In addition, to better model the sequence and time relationship in the description, DPR adopts long short-term memory to encode text vector. We conduct extensive experiments on the Microsoft COCO: Common Objects in Context (MSCOCO) image captioning data set that shows that our method achieves superior performance that Consensus-based Image Description Evaluation (CIDEr) increased to 114.6 after training 30 epochs and runs 2 times faster, compared with other competitive methods. The ablation study verifies the effectiveness of our proposed module.

中文翻译:

用于图像字幕的双位置关系转换器

使用从目标检测器中提取的特征向量已被证明可有效提高图像字幕的性能。然而,认为现有框架存在信息提取不充分的不足,例如位置关系;判断对象之间的关系非常重要。为了填补这一空白,我们提出了一种用于图像字幕的双位置关系转换器 (DPR);该架构改进了图像信息提取和描述编码步骤:它首先计算对象之间的相对位置(RP)和绝对位置(AP),并将对偶位置关系信息集成到自注意力中。具体来说,卷积神经网络(CNN)和更快的R-CNN被应用于提取图像特征和目标检测,然后计算生成的对象框的RP和AP。前者用坐标形式表示,后者用正弦编码计算。此外,为了更好地建模描述中的顺序和时间关系,DPR 采用长短期记忆对文本向量进行编码。我们在 Microsoft COCO: Common Objects in Context (MSCOCO) 图像字幕数据集上进行了大量实验,表明我们的方法实现了卓越的性能,基于共识的图像描述评估 (CIDEr) 在训练 30 轮后增加到 114.6,运行速度提高了 2 倍,与其他竞争方法相比。消融研究验证了我们提出的模块的有效性。此外,为了更好地建模描述中的顺序和时间关系,DPR 采用长短期记忆对文本向量进行编码。我们在 Microsoft COCO: Common Objects in Context (MSCOCO) 图像字幕数据集上进行了大量实验,表明我们的方法实现了卓越的性能,基于共识的图像描述评估 (CIDEr) 在训练 30 轮后增加到 114.6,运行速度提高了 2 倍,与其他竞争方法相比。消融研究验证了我们提出的模块的有效性。此外,为了更好地建模描述中的顺序和时间关系,DPR 采用长短期记忆对文本向量进行编码。我们在 Microsoft COCO: Common Objects in Context (MSCOCO) 图像字幕数据集上进行了大量实验,表明我们的方法实现了卓越的性能,基于共识的图像描述评估 (CIDEr) 在训练 30 轮后增加到 114.6,运行速度提高了 2 倍,与其他竞争方法相比。消融研究验证了我们提出的模块的有效性。Common Objects in Context (MSCOCO) 图像字幕数据集显示我们的方法实现了卓越的性能,与其他竞争方法相比,基于共识的图像描述评估 (CIDEr) 在训练 30 个时期后增加到 114.6,运行速度快 2 倍。消融研究验证了我们提出的模块的有效性。Common Objects in Context (MSCOCO) 图像字幕数据集显示我们的方法实现了卓越的性能,与其他竞争方法相比,基于共识的图像描述评估 (CIDEr) 在训练 30 个时期后增加到 114.6,运行速度快 2 倍。消融研究验证了我们提出的模块的有效性。
更新日期:2022-12-09
down
wechat
bug