Elsevier

Graphical Models

Volume 106, November 2019, 101050
Graphical Models

Semantic based autoencoder-attention 3D reconstruction network

https://doi.org/10.1016/j.gmod.2019.101050Get rights and content

Abstract

3D object reconstruction from single view image is a challenge task. Due to the fact that the information contained in one isolated image is not sufficient for reasonable 3D shape reconstruction, the existing methods on single-view 3D reconstruction always lack marginal voxels and miss obvious semantic information. To tackle these problem, we propose Semantic Autoencoder-Attention Network (SAAN) for single view 3D reconstruction. Distinct from the common auto-encoder (AE) structure, the proposed network consists of two successive parts. The first part is made of two parallel branches, 3D-autoencoder (3DAE) and Attention Network. 3DAE completes the general shape reconstruction by an AE model, and Attention Network supplements the missing details by a 3D reconstruction attention network. In the other part, we compare the semantic information between the input images and the generated voxel from the autoencoder-attention network (AAN) with Res101 network and 3D-Res101 network. The results of the comparison are fed back to the AAN for parameter adjustment iteratively until to generate the best semantically characterized voxel representations. In the experiments, we verify the feasibility of our network on the ShapeNet dataset. By comparing with the state-of-art methods, the proposed SAAN can produce more precise 3D object models in terms of both qualitative and quantitative evaluation.

Introduction

The task of 3D reconstruction from one or several 2D images is a classic issue which can be traced back to Horn et al. [1]. It is a generally scientific problem in a wide variety of fields, such as virtual reality, autopilot, etc. Accordingly, image-based 3D reconstruction has been a focus of computer vision research for many years. The classic multi-view 3D reconstruction methods concentrate on tackling the correlation between various camera perspective parameters and 3D object model reconstruction, e.g. volumetric graph cuts [2], compressed sensing  [3], structure from motion (SFM) [4] and simultaneous localization and mapping (SLAM) [5].

Moreover, recovering 3D geometric shape from single-view image is also an important problem which can be used in many fields. In high-level image editing, this geometric information can be used to change the lighting and material properties in the scene. Furthermore, single view reconstruction can be used aings a baseline for complex modeling.

However, the single-view 3D reconstruction is an ill-posed problem due to the lack of disparity information. The traditional methods require additional prior knowledge, such as specific scene structure [6], [7] or geometrical constraints on the 3D structure [8], [9], which limit their applications.

Owing to the remarkable achievement of learning methods and the establishment of various 3D object databases, learning methods have been gradually introduced into 3D reconstruction tasks such as Wu et al. [10], Tatarchenko et al. [11], Yan et al. [12] Choy et al. [13] Fan et al. [14].

There are three popular ways to represent a model: polygonal mesh, points and voxel. A voxel is an abbreviation for a volume element, a three-dimensional version of a pixel. In contrast to voxels, points and polygons are often explicitly represented by the coordinates of their vertices. A direct consequence of this difference is that polygons can efficiently represent simple 3D structures with lots of empty or homogeneously filled space, while voxels excel at representing regularly sampled spaces that are non-homogeneously filled. As the neural network has regular requirements for the representation of input and output, the representation of polygonal meshes and points is not easy to apply to the learning method. Although the complexity of the voxel representation is high, the nature of the matrix representation of voxel fully satisfies the requirements of the regularity. Therefore, voxel representation is a common representation in a 3D reconstruction method based on learning. By the way, the point cloud representation of fixed points is also a compromise representation. This paper mainly studies methods based on voxel representation.

The most commonly used 3D reconstruction model architecture is autoencoder structure such as  [13]. However, when we reproduce the experiments of [13], [14], we find that the output models often lack detailed information on single-view reconstruction task. We attribute the problem to the ill-conditioned nature of the single-view reconstruction task and the one-sidedness of the current networks. As parts of missing information can be found in the corresponding image, we propose to address this problem by introducing attention mechanism to the task.

Besides, we also find that some of the generated files miss obvious semantic information. For example, in the generated voxels of some bookcase as shown in Fig. 1, the generated voxels to be closer to the box because of the missing or redundant grid. We hope to use the semantic comparison module to correct this problem.

Of course, we can also explain the semantic comparison module according to the idea of GAN (Generative Adversarial Networks). The generated voxels are compared with the semantics of the original images, and the semantics comparison stage is against to the generation stage. With the adversary of these two stages, the semantic features of the generated models are more obvious. However, compared to GAN, we do not have the cross-training phase of GAN, but just complete the fine-tuning of the network on the pre-trained network. In this sense, we can think of this stage as the resemble-GAN based fine-tuning.

Based on these two ideas, we build an end-to-end system semantic based 3D AE-attention network (SAAN) for single view 3D reconstruction task. Our proposed SAAN consists of two parts. The first part AE-attention network has two branches. The upper branch learns to generate 3D rough shape of an object. The other one integrates the details of the 3D object by attention mechanism. We feed the corresponding image into modified 3D variational autoencoder reconstruction architecture to get the general volumetric occupancy. It learns to endow higher weights to the features of missing details in the image. Consequently, in the Attention Network, we can obtain volumetric occupancy which represents the details of object. Finally, we put the volumetric occupancy of these two branches together to get the full 3D shape object model. In the second semantic stage, we compare the semantic information between the input image and the output object to modify the output. Our architecture generates 3D object models which contain more vivid details and make qualitative and quantitative improvement on ShapeNet dataset [15] compared with [13], [14]

The rest of this paper is organized as follows. Section 2 briefly reviews related work on learning based single view 3D reconstruction. Section 3 introduces the details of our proposed method. Section 4 describes the experimental setup and discusses experimental results of our proposed method, with comparison to the state-of-the-art. and our conclusions are given in Section 5.

Section snippets

Related work

This section presents a brief overview of the existing algorithms for sing-view image 3D reconstruction based on deep learning.

SAAN

In this section, we propose an effective network structure to reconstruct authentic 3D object model from one single view image. It decomposes the 2D-to-3D reconstruction task into two parts (as shown in Fig. 2).

The first part we called AAN is the basic 3D reconstruction network, which is completed by two branches as shown in Fig. 3. One reconstructs rough 3D object shape conditioned on a single image which is sampled from an arbitrary view with common autoencoder structure, yielding an

Experimental results and analysis

In this section, a set of experiments were performed to test the effectiveness and generalization of our proposed network.

Conclusion

In this paper, we design a 3D reconstruction network SAAN. The proposed method decomposes the prediction into two parts. The first part is made of two parallel branches. The 3DAE branch produces rough 3D shape by a standard AE, and the attention branch establishes the correspondence between missing details in volumetric occupancy and regions in image to add the details for completing 3D model shape. In the other part, we then compare the semantic information between the input images and the

Declaration of Competing Interest

Authors declare that they have no conflict of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No. 61971383, No. 61631016 and No. 61801441.

References (32)

  • B. K. P. Horn, Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View,...
  • Y. Zhu et al.

    Multi-view stereo reconstruction via voxel clustering and optimization of parallel volumetric graph cuts

    (2011)
  • D. Reddy et al.

    Compressed Sensing for Multi-View Tracking and 3-DVoxel Reconstruction

    (2008)
  • J. Fuentes-Pacheco et al.

    Visual simultaneous localization and mapping: a survey

    Artif. Intell. Rev.

    (2015)
  • K. Häming et al.

    The structure-from-motion reconstruction pipeline–a survey with focus on short image sequences

    Kybernetika

    (2010)
  • Y. Horry et al.

    Tour into the picture: using a spidery mesh interface to make animation from a single image

    Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques

    (1997)
  • D. Hoiem et al.

    Automatic photo pop-up

    ACM transactions on Graphics (TOG)

    (2005)
  • P.F. Sturm et al.

    A method for interactive 3D reconstruction of piecewise planar objects from single images

    Proc Bmvc

    (1999)
  • E. Guillou et al.

    Using vanishing points for camera calibration and coarse 3D reconstruction from a single image

    Vis. Comput.

    (2000)
  • Z. Wu et al.

    3D shapenets: a deep representation for volumetric shapes

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • M. Tatarchenko et al.

    Multi-view 3D models from single images with a convolutional network

    European Conference on Computer Vision

    (2016)
  • X. Yan et al.

    Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision

    Advances in Neural Information Processing Systems

    (2016)
  • C.B. Choy et al.

    3D-R2N2: a unified approach for single and multi-view 3D object reconstruction

    European Conference on Computer Vision

    (2016)
  • H. Fan et al.

    A point set generation network for 3D object reconstruction from a single image

    Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • L. Yi, H. Su, L. Shao, M. Savva, H. Huang, Y. Zhou, B. Graham, M. Engelcke, R. Klokov, V. Lempitsky, et al.,...
  • R. Girdhar et al.

    Learning a predictable and generative vector representation for objects

    European Conference on Computer Vision

    (2016)
  • Cited by (0)

    View full text