当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Human-Centric Relation Segmentation: Dataset and Solution
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2021-04-27 , DOI: 10.1109/tpami.2021.3075846
Si Liu , Zitian Wang , Yulu Gao , Lejian Ren , Yue Liao , Guanghui Ren , Bo Li , Shuicheng Yan

Vision and language understanding techniques have achieved remarkable progress, but currently it is still difficult to well handle problems involving very fine-grained details. For example, when the robot is told to “bring me the book in the girl’s left hand”, most existing methods would fail if the girl holds one book respectively in her left and right hand. In this work, we introduce a new task named human-centric relation segmentation (HRS), as a fine-grained case of HOI-det. HRS aims to predict the relations between the human and surrounding entities and identify the relation-correlated human parts, which are represented as pixel-level masks. For the above exemplar case, our HRS task produces results in the form of relation triplets $\langle$girl [left hand], hold, book$\rangle$ and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task. Correspondingly, we collect a new Person In Context (PIC) dataset for this new task, which contains 17,122 high-resolution images and densely annotated entity segmentation and relations, including 141 object categories, 23 relation categories and 25 semantic human parts. We also propose a Simultaneous Matching and Segmentation (SMS) framework as a solution to the HRS task. It contains three parallel branches for entity segmentation, subject object matching and human parsing respectively. Specifically, the entity segmentation branch obtains entity masks by dynamically-generated conditional convolutions; the subject object matching branch detects the existence of any relations, links the corresponding subjects and objects by displacement estimation and classifies the interacted human parts; and the human parsing branch generates the pixelwise human part labels. Outputs of the three branches are fused to produce the final HRS results. Extensive experiments on PIC and V-COCO datasets show that the proposed SMS method outperforms baselines with the 36 FPS inference speed. Notably, SMS outperforms the best performing baseline $m$m-KERN with only 17.6 percent time cost. The dataset and code will be released at http://picdataset.com/challenge/index/.

中文翻译:

以人为中心的关系分割:数据集和解决方案

视觉和语言理解技术取得了显着进步,但目前仍然难以很好地处理涉及非常细粒度的细节问题。例如,当机器人被告知“把女孩左手的书拿给我”时,如果女孩的左手和右手分别拿着一本书,现有的大多数方法都会失败。在这项工作中,我们引入了一项名为以人为中心的关系分割 (HRS) 的新任务,作为 HOI-det 的细粒度案例。HRS 旨在预测人类与周围实体之间的关系,并识别与关系相关的人类部分,这些部分表示为像素级掩码。对于上述示例,我们的 HRS 任务以关系三元组的形式产生结果$\语言$女孩[左手],拿着,书$\rangle$并精确分割书的掩码,机器人可以轻松完成抓取任务。相应地,我们为这项新任务收集了一个新的上下文人物(PIC)数据集,其中包含 17,122 张高分辨率图像和密集注释的实体分割和关系,包括 141 个对象类别、23 个关系类别和 25 个语义人体部分。我们还提出了同时匹配和分割 (SMS) 框架作为 HRS 任务的解决方案。它包含三个并行分支,分别用于实体分割、主题对象匹配和人体解析。具体来说,实体分割分支通过动态生成的条件卷积获得实体掩码;主题对象匹配分支检测到任何关系的存在,通过位移估计将相应的主体和物体联系起来,并对相互作用的人体部位进行分类;人体解析分支生成像素级人体部位标签。融合三个分支的输出以产生最终的 HRS 结果。在 PIC 和 V-COCO 数据集上的大量实验表明,所提出的 SMS 方法以 36 FPS 的推理速度优于基线。值得注意的是,SMS 的表现优于表现最好的基线$m$-KERN,时间成本仅为 17.6%。数据集和代码将在http://picdataset.com/challenge/index/.
更新日期:2021-04-27
down
wechat
bug