Learning Multimodal Affinities for Textual Editing in Images,ACM Transactions on Graphics

当前位置： X-MOL 学术 › ACM Trans. Graph. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Multimodal Affinities for Textual Editing in Images
ACM Transactions on Graphics ( IF 7.8 ) Pub Date : 2021-07-15 , DOI: 10.1145/3451340
Or Perel ₁ , Oron Anschel ₁ , Omri Ben-Eliezer ₂ , Shai Mazor ₁ , Hadar Averbuch-Elor ₃

Affiliation

Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with critical semantics and complicated layouts. In this work, we devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image, considering their visual style, the content of their underlying text, and their geometric context within the image. We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups. The core of our approach is a deep optimization scheme dedicated for an image provided by the user that detects and leverages reliable pairwise connections in the multimodal representation of the textual elements to properly learn the affinities. We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations manipulating the content, appearance, and geometry of the image.

中文翻译：

学习图像文本编辑的多模态亲和力

如今，随着相机在我们的日常生活中迅速普及，文档图像变得越来越丰富和普遍。与捕获物理对象的自然图像不同，文档图像包含大量具有关键语义和复杂布局的文本。在这项工作中，我们设计了一种通用的无监督技术来学习文档图像中文本实体之间的多模态相似性，考虑到它们的视觉风格、基础文本的内容以及它们在图像中的几何上下文。然后，我们使用这些学习到的相似性将图像中的文本实体自动聚类到不同的语义组中。我们方法的核心是一种深度优化方案，专门用于用户提供的图像，该方案检测并利用文本元素的多模态表示中可靠的成对连接来正确学习亲和力。我们展示了我们的技术可以对跨越各种文档的高度变化的图像进行操作，并证明其适用于操纵图像的内容、外观和几何形状的各种编辑操作。

更新日期：2021-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11