当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Visual Question Answering for Cultural Heritage
arXiv - CS - Computation and Language Pub Date : 2020-03-22 , DOI: arxiv-2003.09853
Pietro Bongini, Federico Becattini, Andrew D. Bagdanov, Alberto Del Bimbo

Technology and the fruition of cultural heritage are becoming increasingly more entwined, especially with the advent of smart audio guides, virtual and augmented reality, and interactive installations. Machine learning and computer vision are important components of this ongoing integration, enabling new interaction modalities between user and museum. Nonetheless, the most frequent way of interacting with paintings and statues still remains taking pictures. Yet images alone can only convey the aesthetics of the artwork, lacking is information which is often required to fully understand and appreciate it. Usually this additional knowledge comes both from the artwork itself (and therefore the image depicting it) and from an external source of knowledge, such as an information sheet. While the former can be inferred by computer vision algorithms, the latter needs more structured data to pair visual content with relevant information. Regardless of its source, this information still must be be effectively transmitted to the user. A popular emerging trend in computer vision is Visual Question Answering (VQA), in which users can interact with a neural network by posing questions in natural language and receiving answers about the visual content. We believe that this will be the evolution of smart audio guides for museum visits and simple image browsing on personal smartphones. This will turn the classic audio guide into a smart personal instructor with which the visitor can interact by asking for explanations focused on specific interests. The advantages are twofold: on the one hand the cognitive burden of the visitor will decrease, limiting the flow of information to what the user actually wants to hear; and on the other hand it proposes the most natural way of interacting with a guide, favoring engagement.



技术和文化遗产的成果正变得越来越紧密,尤其是随着智能语音导览、虚拟和增强现实以及互动装置的出现。机器学习和计算机视觉是这种持续集成的重要组成部分,支持用户和博物馆之间的新交互方式。尽管如此,与绘画和雕像互动最频繁的方式仍然是拍照。然而,图像本身只能传达艺术作品的美感,缺乏充分理解和欣赏艺术作品所需的信息。通常,这种额外的知识既来自艺术品本身(以及描绘它的图像),也来自外部知识来源,例如信息表。而前者可以通过计算机视觉算法推断出来,后者需要更多结构化数据来将视觉内容与相关信息配对。无论其来源如何,这些信息仍然必须有效地传输给用户。计算机视觉中一个流行的新兴趋势是视觉问答 (VQA),其中用户可以通过用自然语言提出问题并接收有关视觉内容的答案来与神经网络进行交互。我们相信,这将是用于博物馆参观和在个人智能手机上进行简单图像浏览的智能语音导览的演变。这将把经典的音频指南变成一个聪明的私人教练,访问者可以通过询问专注于特定兴趣的解释来与之互动。好处是双重的:一方面,游客的认知负担会减少,将信息流限制在用户实际想听到的范围内;另一方面,它提出了与导游互动的最自然方式,有利于参与。