Enhancing Social Relation Inference with Concise Interaction Graph and Discriminative Scene Representation,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Enhancing Social Relation Inference with Concise Interaction Graph and Discriminative Scene Representation
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-07-30 , DOI: arxiv-2107.14425
Xiaotian Yu, Hanling Yi, Yi Yu, Ling Xing, Shiliang Zhang, Xiaoyu Wang

There has been a recent surge of research interest in attacking the problem of social relation inference based on images. Existing works classify social relations mainly by creating complicated graphs of human interactions, or learning the foreground and/or background information of persons and objects, but ignore holistic scene context. The holistic scene refers to the functionality of a place in images, such as dinning room, playground and office. In this paper, by mimicking human understanding on images, we propose an approach of \textbf{PR}actical \textbf{I}nference in \textbf{S}ocial r\textbf{E}lation (PRISE), which concisely learns interactive features of persons and discriminative features of holistic scenes. Technically, we develop a simple and fast relational graph convolutional network to capture interactive features of all persons in one image. To learn the holistic scene feature, we elaborately design a contrastive learning task based on image scene classification. To further boost the performance in social relation inference, we collect and distribute a new large-scale dataset, which consists of about 240 thousand unlabeled images. The extensive experimental results show that our novel learning framework significantly beats the state-of-the-art methods, e.g., PRISE achieves 6.8$\%$ improvement for domain classification in PIPA dataset.

中文翻译：

用简洁的交互图和判别式场景表示增强社会关系推理

最近对解决基于图像的社会关系推理问题的研究兴趣激增。现有的工作主要通过创建复杂的人类交互图或学习人和物体的前景和/或背景信息来对社会关系进行分类，而忽略了整体的场景上下文。整体场景是指图像中某个地方的功能，例如餐厅、游乐场和办公室。在本文中，通过模仿人类对图像的理解，我们提出了一种 \textbf{PR}actical \textbf{I}nference in \textbf{S}ocial r\textbf{E}lation (PRISE) 的方法，该方法简明地学习交互式人的特征和整体场景的判别特征。从技术上讲，我们开发了一个简单快速的关系图卷积网络来捕获一张图像中所有人的交互特征。为了学习整体场景特征，我们精心设计了一个基于图像场景分类的对比学习任务。为了进一步提高社会关系推理的性能，我们收集并分发了一个新的大规模数据集，其中包含大约 24 万张未标记的图像。广泛的实验结果表明，我们的新学习框架显着击败了最先进的方法，例如，PRIZE 在 PIPA 数据集中的域分类实现了 6.8$\%$ 的改进。我们收集并分发了一个新的大规模数据集，其中包含大约 24 万张未标记的图像。广泛的实验结果表明，我们的新学习框架显着击败了最先进的方法，例如，PRIZE 在 PIPA 数据集中的域分类实现了 6.8$\%$ 的改进。我们收集并分发了一个新的大规模数据集，其中包含大约 24 万张未标记的图像。广泛的实验结果表明，我们的新学习框架显着击败了最先进的方法，例如，PRIZE 在 PIPA 数据集中的域分类实现了 6.8$\%$ 的改进。

更新日期：2021-08-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文