Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2021-05-12 , DOI: 10.1145/3436494
Shikha Gupta ₁ , Krishan Sharma ₁ , Dileep Aroor Dinesh ₁ , Veena Thenkanidiyoor ₂

Affiliation

In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,” “car,” “sky,” etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.

中文翻译：

使用深度 CNN 进行场景识别的基于视觉语义的表示学习

在这项工作中，我们解决了从图像数据中进行场景识别的任务。场景是各种视觉语义内容的空间相关排列，也称为概念，例如“椅子”、“汽车”、“天空”等。使用视觉语义内容的表示学习可以被视为最简单的想法之一：它模仿人类感知视觉信息的行为。语义多项式（SMN）表示是一种这样的表示，它使用概念的后验概率来捕获语义信息。获得 SMN 表示的核心部分是概念模型的构建。因此，有必要为图像中存在的每个概念提供真实（真实）概念标签。此外，由于数据集中的大量图像，手动标记概念实际上是不可行的。为了解决这个问题，我们提出了一种在没有真正概念标签的情况下生成伪概念的方法。我们利用预训练的基于深度 CNN 的架构，其中来自卷积层的激活图（滤波器响应）被视为伪概念的初始线索。使用提出的基于过滤器特定阈值的方法去除不重要的激活图，从而从数据中去除不重要的概念。此外，我们提出了一种分组机制，使用滤波器响应的子空间建模对相同的伪概念进行分组，以实现非冗余表示。实验研究表明，即使在没有真正概念标签的情况下，使用伪概念生成的 SMN 表示在标准数据集（如 MIT-67 和 SUN-397）上的场景识别任务中也能取得可比的结果。

更新日期：2021-05-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文