Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
arXiv - CS - Computation and Language Pub Date : 2019-11-14 , DOI: arxiv-1911.06258
Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

中文翻译：

用于 TextVQA 的指针增强多模态变换器的迭代答案预测

许多视觉场景都包含携带关键信息的文本，因此在下游推理任务中理解图像中的文本是必不可少的。例如，警告标志上的深水标签警告人们注意现场的危险。最近的工作探索了 TextVQA 任务，该任务需要阅读和理解图像中的文本来回答问题。然而，现有的 TextVQA 方法主要基于一对两种模态之间的自定义成对融合机制，并且通过将 TextVQA 转换为分类任务而仅限于单个预测步骤。在这项工作中，我们提出了一种新的 TextVQA 任务模型，该模型基于多模态转换器架构，并伴随着图像中文本的丰富表示。我们的模型通过将不同的模态嵌入到一个共同的语义空间中，自然地均匀地融合了不同的模态，在这个空间中，自我注意被应用于对模态间和模态内的上下文进行建模。此外，它还支持使用动态指针网络进行迭代答案解码，允许模型通过多步预测而不是一步分类来形成答案。我们的模型在 TextVQA 任务的三个基准数据集上大大优于现有方法。

更新日期：2020-03-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文