Object Priors for Classifying and Localizing Unseen Actions,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Object Priors for Classifying and Localizing Unseen Actions
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2021-04-19 , DOI: 10.1007/s11263-021-01454-y
Pascal Mettes , William Thong , Cees G. M. Snoek

This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

中文翻译：

对象先验，用于对看不见的动作进行分类和本地化

这项工作致力于在视频中对人类行为进行分类和定位，而无需任何带标签的视频培训示例。在现有工作依赖于将全局属性或对象信息从可见的动作视频转移到看不见的动作视频的情况下，我们寻求对仅基于图像的对象信息的视频中的看不见的动作进行分类和时空定位。我们提出了三个空间物体先验，它们对本地人和物体检测器及其空间关系进行编码。最重要的是，我们介绍了三个语义对象先验，它们通过单词嵌入扩展了语义匹配，并具有三个简单的函数来处理语义歧义，对象歧视和对象命名。视频嵌入结合了空间和语义对象先验。它使我们能够引入新的视频检索任务，该任务根据用户指定的对象，空间关系和对象大小来检索视频集合中的动作管。对五个动作数据集的实验评估表明，对于看不见的动作而言，空间和语义对象先验的重要性。我们发现，人和物体具有有利的空间关系，有利于看不见的动作定位，而使用多种语言和简单的对象过滤可以直接改善语义匹配，从而为看不见的动作分类和定位带来最新的结果。

更新日期：2021-04-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11