Cross-Modal Pyramid Translation for RGB-D Scene Recognition,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-Modal Pyramid Translation for RGB-D Scene Recognition
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2021-05-18 , DOI: 10.1007/s11263-021-01475-7
Dapeng Du , Limin Wang , Zhaoyang Li , Gangshan Wu

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

中文翻译：

用于RGB-D场景识别的跨模态金字塔翻译

现有的RGB-D场景识别方法通常采用两个单独的且特定于模态的网络来分别学习有效的RGB和深度表示。这种独立的训练方案无法捕获两种模态的相关性，因此对于RGB-D场景识别可能不是最佳的。为了解决这个问题，本文提出了一个通用且灵活的框架，以定制的跨模态金字塔平移分支（称为TRecgNet）增强RGB-D表示学习。该框架通过共享特征编码器将跨模态翻译和特定于模态的识别任务统一起来，旨在利用两个模态之间的对应关系来规范化每个模态的表示学习。具体来说，我们提出了一种跨模态的金字塔翻译策略，可以通过精心设计的逐层感知监控来执行多尺度图像生成。为了提高跨模态翻译与特定于情态的场景识别的互补性，我们设计了一个特征选择模块，以在翻译过程中自适应地增强判别信息。此外，我们训练了多个辅助分类器，以进一步规范生成的数据的行为，使其与其在标签预测中的配对数据保持一致。同时，我们的翻译分支使我们能够生成用于训练数据增强的跨模态数据，并进一步改善单模态场景识别。在SUN RGB-D和NYU Depth V2的基准上进行的大量实验证明了该方法相对于最新的RGB-D场景识别方法的优越性。我们还将TRecgNet推广到MIT Indoor的单模态场景识别基准，并自动合成深度视图以提高最终识别的准确性。

更新日期：2021-05-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11