A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2017-08-30 , DOI: 10.1109/tpami.2017.2747134
Umar Asif , Mohammed Bennamoun , Ferdous A. Sohel

While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance-this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability-this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion-this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

中文翻译：

用于 RGB-D 对象标记的多模态、判别性和空间不变的 CNN

虽然深度卷积神经网络在图像分类方面取得了显着的成功，但类间相似性、类内方差、多模态数据的有效组合以及物体图像的空间变异性问题仍然是主要挑战。为了解决这些问题，本文提出了一种新的框架来学习使用多模态 RGB-D 图像进行物体和室内场景识别的判别性和空间不变的分类模型。这是通过三个假设实现的：1）空间不变性——这是通过将空间变换网络与深度卷积神经网络相结合来学习对空间平移、旋转和尺度变化不变的特征来实现的，2）高判别能力——这是是通过在CNN架构中引入Fisher编码来学习类间相似性小、类内紧凑性大的特征来实现的，3）多模态层次融合——这是通过将语义分割正则化为多模态来实现的。 CNN 架构，其中类概率在不同的层次级别（即图像和像素级别）进行估计，并融合到基于条件随机场 (CRF) 的推理假设中，其优化在 RGB-D 图像中产生一致的类标签。对 RGB-D 对象和场景数据集以及实时视频流（从 Kinect 获取）进行的广泛实验评估表明，与最先进的方法相比，我们的框架可产生出色的对象和场景分类结果。

更新日期：2017-08-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11