Semantic based autoencoder-attention 3D reconstruction network
Graphical abstract
Introduction
The task of 3D reconstruction from one or several 2D images is a classic issue which can be traced back to Horn et al. [1]. It is a generally scientific problem in a wide variety of fields, such as virtual reality, autopilot, etc. Accordingly, image-based 3D reconstruction has been a focus of computer vision research for many years. The classic multi-view 3D reconstruction methods concentrate on tackling the correlation between various camera perspective parameters and 3D object model reconstruction, e.g. volumetric graph cuts [2], compressed sensing [3], structure from motion (SFM) [4] and simultaneous localization and mapping (SLAM) [5].
Moreover, recovering 3D geometric shape from single-view image is also an important problem which can be used in many fields. In high-level image editing, this geometric information can be used to change the lighting and material properties in the scene. Furthermore, single view reconstruction can be used aings a baseline for complex modeling.
However, the single-view 3D reconstruction is an ill-posed problem due to the lack of disparity information. The traditional methods require additional prior knowledge, such as specific scene structure [6], [7] or geometrical constraints on the 3D structure [8], [9], which limit their applications.
Owing to the remarkable achievement of learning methods and the establishment of various 3D object databases, learning methods have been gradually introduced into 3D reconstruction tasks such as Wu et al. [10], Tatarchenko et al. [11], Yan et al. [12] Choy et al. [13] Fan et al. [14].
There are three popular ways to represent a model: polygonal mesh, points and voxel. A voxel is an abbreviation for a volume element, a three-dimensional version of a pixel. In contrast to voxels, points and polygons are often explicitly represented by the coordinates of their vertices. A direct consequence of this difference is that polygons can efficiently represent simple 3D structures with lots of empty or homogeneously filled space, while voxels excel at representing regularly sampled spaces that are non-homogeneously filled. As the neural network has regular requirements for the representation of input and output, the representation of polygonal meshes and points is not easy to apply to the learning method. Although the complexity of the voxel representation is high, the nature of the matrix representation of voxel fully satisfies the requirements of the regularity. Therefore, voxel representation is a common representation in a 3D reconstruction method based on learning. By the way, the point cloud representation of fixed points is also a compromise representation. This paper mainly studies methods based on voxel representation.
The most commonly used 3D reconstruction model architecture is autoencoder structure such as [13]. However, when we reproduce the experiments of [13], [14], we find that the output models often lack detailed information on single-view reconstruction task. We attribute the problem to the ill-conditioned nature of the single-view reconstruction task and the one-sidedness of the current networks. As parts of missing information can be found in the corresponding image, we propose to address this problem by introducing attention mechanism to the task.
Besides, we also find that some of the generated files miss obvious semantic information. For example, in the generated voxels of some bookcase as shown in Fig. 1, the generated voxels to be closer to the box because of the missing or redundant grid. We hope to use the semantic comparison module to correct this problem.
Of course, we can also explain the semantic comparison module according to the idea of GAN (Generative Adversarial Networks). The generated voxels are compared with the semantics of the original images, and the semantics comparison stage is against to the generation stage. With the adversary of these two stages, the semantic features of the generated models are more obvious. However, compared to GAN, we do not have the cross-training phase of GAN, but just complete the fine-tuning of the network on the pre-trained network. In this sense, we can think of this stage as the resemble-GAN based fine-tuning.
Based on these two ideas, we build an end-to-end system semantic based 3D AE-attention network (SAAN) for single view 3D reconstruction task. Our proposed SAAN consists of two parts. The first part AE-attention network has two branches. The upper branch learns to generate 3D rough shape of an object. The other one integrates the details of the 3D object by attention mechanism. We feed the corresponding image into modified 3D variational autoencoder reconstruction architecture to get the general volumetric occupancy. It learns to endow higher weights to the features of missing details in the image. Consequently, in the Attention Network, we can obtain volumetric occupancy which represents the details of object. Finally, we put the volumetric occupancy of these two branches together to get the full 3D shape object model. In the second semantic stage, we compare the semantic information between the input image and the output object to modify the output. Our architecture generates 3D object models which contain more vivid details and make qualitative and quantitative improvement on ShapeNet dataset [15] compared with [13], [14]
The rest of this paper is organized as follows. Section 2 briefly reviews related work on learning based single view 3D reconstruction. Section 3 introduces the details of our proposed method. Section 4 describes the experimental setup and discusses experimental results of our proposed method, with comparison to the state-of-the-art. and our conclusions are given in Section 5.
Section snippets
Related work
This section presents a brief overview of the existing algorithms for sing-view image 3D reconstruction based on deep learning.
SAAN
In this section, we propose an effective network structure to reconstruct authentic 3D object model from one single view image. It decomposes the 2D-to-3D reconstruction task into two parts (as shown in Fig. 2).
The first part we called AAN is the basic 3D reconstruction network, which is completed by two branches as shown in Fig. 3. One reconstructs rough 3D object shape conditioned on a single image which is sampled from an arbitrary view with common autoencoder structure, yielding an
Experimental results and analysis
In this section, a set of experiments were performed to test the effectiveness and generalization of our proposed network.
Conclusion
In this paper, we design a 3D reconstruction network SAAN. The proposed method decomposes the prediction into two parts. The first part is made of two parallel branches. The 3DAE branch produces rough 3D shape by a standard AE, and the attention branch establishes the correspondence between missing details in volumetric occupancy and regions in image to add the details for completing 3D model shape. In the other part, we then compare the semantic information between the input images and the
Declaration of Competing Interest
Authors declare that they have no conflict of interest.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 61971383, No. 61631016 and No. 61801441.
References (32)
- B. K. P. Horn, Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View,...
- et al.
Multi-view stereo reconstruction via voxel clustering and optimization of parallel volumetric graph cuts
(2011) - et al.
Compressed Sensing for Multi-View Tracking and 3-DVoxel Reconstruction
(2008) - et al.
Visual simultaneous localization and mapping: a survey
Artif. Intell. Rev.
(2015) - et al.
The structure-from-motion reconstruction pipeline–a survey with focus on short image sequences
Kybernetika
(2010) - et al.
Tour into the picture: using a spidery mesh interface to make animation from a single image
Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques
(1997) - et al.
Automatic photo pop-up
ACM transactions on Graphics (TOG)
(2005) - et al.
A method for interactive 3D reconstruction of piecewise planar objects from single images
Proc Bmvc
(1999) - et al.
Using vanishing points for camera calibration and coarse 3D reconstruction from a single image
Vis. Comput.
(2000) - et al.
3D shapenets: a deep representation for volumetric shapes
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2015)