Learning Asynchronous and Sparse Human-Object Interaction in Videos,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Asynchronous and Sparse Human-Object Interaction in Videos
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-03-03 , DOI: arxiv-2103.02758
Romero Morais, Vuong Le, Svetha Venkatesh, Truyen Tran

Human activities can be learned from video. With effective modeling it is possible to discover not only the action labels but also the temporal structures of the activities such as the progression of the sub-activities. Automatically recognizing such structure from raw video signal is a new capability that promises authentic modeling and successful recognition of human-object interactions. Toward this goal, we introduce Asynchronous-Sparse Interaction Graph Networks (ASSIGN), a recurrent graph network that is able to automatically detect the structure of interaction events associated with entities in a video scene. ASSIGN pioneers learning of autonomous behavior of video entities including their dynamic structure and their interaction with the coexisting neighbors. Entities' lives in our model are asynchronous to those of others therefore more flexible in adaptation to complex scenarios. Their interactions are sparse in time hence more faithful to the true underlying nature and more robust in inference and learning. ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos. The native ability for discovering temporal structures of the model also eliminates the dependence on external segmentation that was previously mandatory for this task.

中文翻译：

在视频中学习异步和稀疏的人对对象交互

可以从视频中学到人类活动。通过有效的建模，不仅可以发现动作标签，还可以发现活动的时间结构，例如子活动的进展。从原始视频信号自动识别这种结构是一项新功能，可保证进行可靠的建模并成功识别人与物体之间的相互作用。为了实现这一目标，我们引入了异步稀疏交互图网络（ASSIGN），这是一种递归图网络，能够自动检测与视频场景中的实体相关联的交互事件的结构。ASSIGN率先学习视频实体的自主行为，包括其动态结构以及与并存邻居的互动。实体我们模型中的生活与其他人的生活是异步的，因此在适应复杂场景时更加灵活。他们之间的互动在时间上是稀疏的，因此更加忠实于真实的内在本质，并且在推论和学习上也更加健壮。ASSIGN经过了人与对象交互识别的测试，在分割和标记人类子活动以及原始视频中的对象馈赠方面显示出卓越的性能。发现模型的时间结构的本机能力还消除了以前对该任务必不可少的对外部分段的依赖。ASSIGN经过了人与对象交互识别的测试，在分割和标记人类子活动以及原始视频中的对象馈赠方面显示出卓越的性能。发现模型的时间结构的本机能力还消除了以前对该任务必不可少的对外部分段的依赖。ASSIGN经过了人与对象交互识别的测试，在分割和标记人类子活动以及原始视频中的对象馈赠方面显示出卓越的性能。发现模型的时间结构的本机能力还消除了以前对该任务必不可少的对外部分段的依赖。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文