Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding,arXiv - CS - Graphics

当前位置： X-MOL 学术 › arXiv.cs.GR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
arXiv - CS - Graphics Pub Date : 2020-11-04 , DOI: arxiv-2011.02523
Mike Roberts, Nathan Paczan

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. Our dataset: (1) relies exclusively on publicly available 3D assets; (2) includes complete scene geometry, material information, and lighting information for every scene; (3) includes dense per-pixel semantic instance segmentations for every image; and (4) factors every image into diffuse reflectance, diffuse illumination, and a non-diffuse residual term that captures view-dependent lighting effects. Together, these features make our dataset well-suited for geometric learning problems that require direct 3D supervision, multi-task learning problems that require reasoning jointly over multiple input and output modalities, and inverse rendering problems. We analyze our dataset at the level of scenes, objects, and pixels, and we analyze costs in terms of money, annotation effort, and computation time. Remarkably, we find that it is possible to generate our entire dataset from scratch, for roughly half the cost of training a state-of-the-art natural language processing model. All the code we used to generate our dataset is available online.

中文翻译：

Hypersim：用于整体室内场景理解的真实感合成数据集

对于许多基本的场景理解任务，很难或不可能从真实图像中获得每像素地面实况标签。我们通过引入 Hypersim 来应对这一挑战，Hypersim 是一种用于整体室内场景理解的逼真合成数据集。为了创建我们的数据集，我们利用了由专业艺术家创建的大型合成场景存储库，我们生成了 461 个室内场景的 77,400 张图像，带有详细的每像素标签和相应的地面实况几何。我们的数据集：(1) 完全依赖于公开可用的 3D 资产；(2) 包括每个场景的完整场景几何、材质信息和光照信息；(3) 包括每个图像的密集每像素语义实例分割；(4) 将每个图像分解为漫反射、漫射照明，和一个非漫射残差项，用于捕捉与视图相关的照明效果。总之，这些特性使我们的数据集非常适合需要直接 3D 监督的几何学习问题、需要在多个输入和输出模式上联合推理的多任务学习问题以及逆向渲染问题。我们在场景、对象和像素级别分析我们的数据集，并分析资金、注释工作和计算时间方面的成本。值得注意的是，我们发现可以从头开始生成整个数据集，其成本大约是训练最先进的自然语言处理模型的一半。我们用来生成数据集的所有代码都可以在线获得。这些特征使我们的数据集非常适合需要直接 3D 监督的几何学习问题、需要在多个输入和输出模式上联合推理的多任务学习问题以及逆向渲染问题。我们在场景、对象和像素级别分析我们的数据集，并分析资金、注释工作和计算时间方面的成本。值得注意的是，我们发现可以从头开始生成整个数据集，其成本大约是训练最先进的自然语言处理模型的一半。我们用来生成数据集的所有代码都可以在线获得。这些特征使我们的数据集非常适合需要直接 3D 监督的几何学习问题、需要在多个输入和输出模式上联合推理的多任务学习问题以及逆向渲染问题。我们在场景、对象和像素级别分析我们的数据集，并分析资金、注释工作和计算时间方面的成本。值得注意的是，我们发现可以从头开始生成整个数据集，其成本大约是训练最先进的自然语言处理模型的一半。我们用来生成数据集的所有代码都可以在线获得。和像素，我们分析了金钱、注释工作和计算时间方面的成本。值得注意的是，我们发现可以从头开始生成整个数据集，其成本大约是训练最先进的自然语言处理模型的一半。我们用来生成数据集的所有代码都可以在线获得。和像素，我们分析了金钱、注释工作和计算时间方面的成本。值得注意的是，我们发现可以从头开始生成整个数据集，其成本大约是训练最先进的自然语言处理模型的一半。我们用来生成数据集的所有代码都可以在线获得。

更新日期：2020-11-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文