Multimodal grid features and cell pointers for Scene Text Visual Question Answering,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multimodal grid features and cell pointers for Scene Text Visual Question Answering
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2021-07-20 , DOI: 10.1016/j.patrec.2021.06.026
Lluís Gómez ₁ , Ali Furkan Biten ₁ , Rubén Tito ₁ , Andrés Mafla ₁ , Marçal Rusiñol ₁ , Ernest Valveny ₁ , Dimosthenis Karatzas ₁

Affiliation

This paper presents a new model for the task of scene text visual question answering. In this task questions about a given image can only be answered by reading and understanding scene text. Current state of the art models for this task make use of a dual attention mechanism in which one attention module attends to visual features while the other attends to textual features. A possible issue with this is that it makes difficult for the model to reason jointly about both modalities. To fix this problem we propose a new model that is based on an single attention mechanism that attends to multi-modal features conditioned to the question. The output weights of this attention module over a grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text to the given question. Our experiments demonstrate competitive performance in two standard datasets with a model that is $\times 5$ faster than previous methods at inference time. Furthermore, we also provide a novel analysis of the ST-VQA dataset based on a human performance study. Supplementary material, code, and data is made available through this link.

中文翻译：

场景文本视觉问答的多模态网格特征和单元格指针

本文提出了一种新的场景文本视觉问答任务模型。在这个任务中，关于给定图像的问题只能通过阅读和理解场景文本来回答。此任务的当前最先进模型使用双重注意机制，其中一个注意模块关注视觉特征，而另一个关注文本特征。一个可能的问题是模型很难对两种模式进行联合推理。为了解决这个问题，我们提出了一种基于单一注意力机制的新模型，该模型关注以问题为条件的多模态特征。该注意力模块在多模态空间特征网格上的输出权重被解释为图像的某个空间位置包含给定问题的答案文本的概率。 $\times 5$ 在推理时比以前的方法更快。此外，我们还提供了基于人类表现研究的 ST-VQA 数据集的新颖分析。补充材料、代码和数据可通过此链接获得。

更新日期：2021-07-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>