Structured Label Inference for Visual Understanding.,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Structured Label Inference for Visual Understanding.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2019-01-16 , DOI: 10.1109/tpami.2019.2893215
Nelson Isao Nauata Junior , Hexiang Hu , Guang-Tong Zhou , Zhiwei Deng , Zicheng Liao , Greg Mori

Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines.

中文翻译：

用于视觉理解的结构化标签推断。

图像和视频之类的视觉数据包含丰富的结构化语义标签以及广泛的交互组件。视觉内容可以分配有描述主要成分的细粒度标签，描述高级抽象的粗粒度标签或一组揭示属性的标签。标签在不同的，相互作用的层上的这种分类表明标签信息基于图的编码的潜力。在本文中，我们利用这种丰富的结构在标签空间中执行基于图的推理，以完成许多任务：多标签图像和视频分类以及未修剪视频中的动作检测。我们考虑使用双向推理神经网络（BINN）和结构化推理神经网络（SINN）在标签空间中执行基于图的推理，并提出基于长短期记忆（LSTM）的扩展，以利用未修饰视频上的活动进度。在（i）具有属性的动物（AwA），场景理解（SUN）和NUS-WIDE数据集上对这些方法进行了评估，以进行多标签图像分类；（ii）前两个YouTube-8M大规模数据集用于多标签图像分类标签视频分类，以及（iii）用于动作检测的THUMOS'14和MultiTHUMOS视频数据集。我们的结果证明了结构化标签推理在这些挑战性任务中的有效性，相对于基线而言取得了显着改善。

更新日期：2020-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>