Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2020-03-20 , DOI: 10.1007/s11263-020-01322-1
Sai Rajeswar , Fahim Mannan , Florian Golemo , Jérôme Parent-Lévesque , David Vazquez , Derek Nowrouzezahrai , Aaron Courville

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape , an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks.

中文翻译：

Pix2Shape：使用基于视图的表示从图像中无监督地学习 3D 场景

我们在没有监督的情况下从单个输入图像推断并生成三维（3D）场景信息。这个问题没有得到充分探索，大多数先前的工作依赖于来自例如 3D 地面实况、场景的多个图像、图像轮廓或关键点的监督。我们提出了 Pix2Shape，这是一种解决这个问题的方法，它有四个组件：（i）一个从图像推断潜在 3D 表示的编码器，（ii）一个解码器，它生成一个基于 2.5D 面元的显式场景重建——来自潜在代码——(iii) 一个可微渲染器，它从表面元素表示合成 2D 图像，以及 (iv) 一个训练有素的网络，以区分由解码器渲染器生成的图像和来自训练分布的图像。Pix2Shape 可以生成复杂的 3D 场景，这些场景随与视图相关的屏幕分辨率而缩放，这与捕获世界空间分辨率（即体素或网格）的表示不同。我们展示了 Pix2Shape 在其编码的潜在空间中学习一致的场景表示，然后可以将解码器应用于该潜在表示，以便从新的角度合成场景。我们通过对 ShapeNet 数据集的实验以及我们开发的新基准（称为 3D-IQTT）来评估 Pix2Shape，以根据模型启用 3d 空间推理的能力来评估模型。定性和定量评估证明了 Pix2Shape 解决场景重建、生成和理解任务的能力。我们展示了 Pix2Shape 在其编码的潜在空间中学习一致的场景表示，然后可以将解码器应用于该潜在表示，以便从新的角度合成场景。我们通过对 ShapeNet 数据集的实验以及我们开发的新基准（称为 3D-IQTT）来评估 Pix2Shape，以根据模型启用 3d 空间推理的能力来评估模型。定性和定量评估证明了 Pix2Shape 解决场景重建、生成和理解任务的能力。我们展示了 Pix2Shape 在其编码的潜在空间中学习一致的场景表示，然后可以将解码器应用于该潜在表示，以便从新的角度合成场景。我们通过在 ShapeNet 数据集上的实验以及我们开发的新基准（称为 3D-IQTT）来评估 Pix2Shape，以根据模型启用 3d 空间推理的能力来评估模型。定性和定量评估证明了 Pix2Shape 解决场景重建、生成和理解任务的能力。我们通过对 ShapeNet 数据集的实验以及我们开发的新基准（称为 3D-IQTT）来评估 Pix2Shape，以根据模型启用 3d 空间推理的能力来评估模型。定性和定量评估证明了 Pix2Shape 解决场景重建、生成和理解任务的能力。我们通过在 ShapeNet 数据集上的实验以及我们开发的新基准（称为 3D-IQTT）来评估 Pix2Shape，以根据模型启用 3d 空间推理的能力来评估模型。定性和定量评估证明了 Pix2Shape 解决场景重建、生成和理解任务的能力。

更新日期：2020-03-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>