当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 1-30-2018 , DOI: 10.1109/tpami.2018.2799846
Ruimao Zhang , Liang Lin , Guangrun Wang , Meng Wang , Wangmeng Zuo

This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC 2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5,000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.

中文翻译:


通过图像描述的弱监督学习进行分层场景解析



本文研究了场景理解的一个基本问题:如何将场景图像解析为结构化配置(即具有对象交互关系的语义对象层次结构)。我们提出了一种由两个网络组成的深层架构:i)卷积神经网络(CNN)提取图像表示以进行像素级对象标记;ii)递归神经网络(RsNN)发现分层对象结构和对象间关系。我们不依赖精细的注释(例如,手动标记的语义图和关系),而是利用训练图像的描述性句子以弱监督学习的方式训练我们的深度模型。具体来说,我们将每个句子分解为由名词和动词短语组成的语义树,并应用这些树结构来发现训练图像的配置。一旦确定了这些场景配置,CNN 和 RsNN 的参数就会通过反向传播进行相应更新。整个模型训练是通过期望最大化方法完成的。大量实验表明,与其他最先进的弱监督深度学习方法相比,我们的模型能够生成有意义的场景配置,并在两个基准(即 PASCAL VOC 2012 和 SYSU-Scenes)上实现更有利的场景标记结果。特别是,SYSU-Scenes 包含 5000 多个场景图像及其语义句子描述,这是我们为推进场景解析研究而创建的。
更新日期:2024-08-22
down
wechat
bug