Monocular image depth prediction without depth sensors: An unsupervised learning method
Introduction
Depth prediction is a hot area of research in the domain of computer and robot vision. In various application scenarios, such as 3D reconstruction [1], [2], scene recognition [3], [4], semantic segmentation [5], [6], [7], augmented reality [8], assisted driving and automatic guidance [9], [10], [11], the depth information is a crucial clue to understand the geometric relationship of the scene. In this paper, we specifically focus on depth estimation problem without depth sensors or ground truth depth data in monocular imaging conditions.
The depth information is often lost in the process of mapping an image from 3D space to 2D plane, and this means that an image captured by camera may correspond to many real-life scenarios. Depth sensors, such as the Microsoft Kinect and LIDAR, are preferred by researchers to obtain the depth information quickly and effectively. However, these sensors are cost-prohibitive and sensitive to strong sunlight and tend to be burdensome in some practical applications [12]. Stereo vision approaches, attractive options over the past few years, rely on two or more rectified images to acquire depth map by indirect calculation [13], [14], [15], [16]. Although it is cheap, lightweight, and more suitable for indoor and outdoor scenes compared with depth sensors, the stereo vision method has its own disadvantages, such as the need for careful rectification to enhance the image quality, and much time is required to obtain an accurate disparity map, and ambiguous predictions are generated in the case of texture-less or confusing textured surfaces. For feature extraction methods, by extracting typical monocular clues involving texture changes, occlusions and object sizes, which typically formulate the depth estimation problem as a Markov random field (MRF) learning problem [1], [17]. However, most monocular clues are “context information”, which are global attributes of an image and cannot be inferred from small image blocks.
Recently, CNN have achieved remarkable success in monocular depth estimation areas, showing that convolutional features are superior to manually extracted ones [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29]. By learning the prior knowledge of objects and their depth, these networks yield better results in indoor or outdoor restricted evaluation scenarios. In earlier research, CNN were often combined with conditional random field (CRF) [24], [25], [26], [27], Fourier Frequency Domain (FFD) [28] and random forest [29] approaches to learn the potential correlations between pixel color and depth. However, these methods have high complexity and the predicted results have poorer resolution than input images. To restore the degraded depth map resolution, some methods adopt the structure of encoding and decoding to realize the conversion from the RGB images domain to the depth maps [19], [20], [23], [24]. The encoder extracts multiscale spatial features, and the decoder adopts a multistage upsampling structure to recover the detailed information of objects in the depth maps by the end-to-end training method. However, a major weakness of the above methods is that they rely on large-scale datasets with ground truth depth maps as the supervision signals. Collecting this datasets in a variety of real-word scenarios is time-consuming and expensive, for example, requiring expensive lasers and depth cameras to collect depth data.
It is not difficult to find that depth sensors are expensive, and depth information may be lost in some practical applications (such as the surface of an object is too smooth). Traditional machine learning methods are greatly affected by the quality of the extracted features. A major weakness of using supervised learning to train CNN is that it is difficult to obtain large-scale and high-quality datasets with ground truth depth maps. While some researchers have proposed unsupervised methods, we still need to further improve the accuracy of prediction to narrow the gap with supervised learning methods. To solve the above issues, we design a novel CNN with an encoding and decoding structure to estimate the depth map, and introduce a new dynamic optimization strategy to enhance the training speed and prediction accuracy. First, the depth prediction problem is considered as the regression problem of disparity maps according to binocular disparity and epipolar geometry principles in our work, and we use a novel CNN model to output the depth map by given rectified stereo image pairs, no other ground truth depth data is needed. Second, we design our encoder–decoder architecture based on some Squeeze-and-Excitation (SE) blocks [30] and add some short-cut connections between corresponding layers in the encoder and decoder to estimate high detail depth information. Finally, we present a new data-dependent upsampling method that supports the network to automatically acquire the importance of each feature channel by learning, and we employ a novel optimizer [31] to improve the convergence speed. When evaluate the performance on the KITTI Split [10], our method achieved the best results on 8 evaluation metrics. On the KITTI Eigen Split [18], our method achieved the best results on 5 and the second best results on 2 evaluation metrics. Different from the previous related work, we evaluate the performance of depth estimation on the Cityscapes [32] dataset demonstrates the effectiveness of our approach.
The rest of paper is organized as follows. In Section 2, we review recent advances in depth estimation from monocular images. In Section 3, we describe the unsupervised learning method of monocular image depth prediction in detail, including three elementary theory statements, network architecture, data augmentation method, optimizer, activation function and loss function construction. In Section 4, based on the established model, we demonstrate the validity and reliability of our method through experimentation and analysis. Finally, the conclusions and future works are drawn up in Section 5.
Section snippets
Related work
Early related work mainly depended on artificially extracted features and probabilistic graphics models. Saxena et al. [17] constructed a depth prediction model applying a supervised learning method, which used multiscale MRF to aggregate the multiscale local and global image features to model the depth relationship between different pixel points in the image. Karsch et al. [33] introduced a technique for automatically generating a trusted depth map from video by nonparametric depth-sampling
The proposed approach
We give detailed description on monocular image depth prediction in this section. First, based on the basic theory of pinhole model, binocular stereo vision and epipolar geometry, we describe our framework for disparity map regression. Second, we also discuss each component of our method, including the data augmentation techniques, appropriate optimizer, activation function and loss function. Finally, we present how to learn network parameter in our unsupervised CNN framework in detail.
Experimentation
In this section, we performed extensive experiments on public benchmark data, including KITTI dataset and Cityscapes dataset, to verify the validity of our model. During training, our method does not require any depth data. Table 4 reports the size of input, feature map and output in our network.
Conclusions
We present a new network structure for predicting dense depth maps from rectified stereo pairs, and it can be trained in an unsupervised end-to-end manner without ground truth depth maps or any pretrained models. To improve the accuracy of the depth prediction and the convergence speed of network, we first propose an encoder–decoder with SE block structure and a novel upsampling strategy to increase the estimation accuracy. Furthermore, our method shows promising performance compared to other
CRediT authorship contribution statement
Songnan Chen: Writing - original draft, Methodology, Visualization, Investigation, Validation, Formal analysis. Mengxia Tang: Conceptualization, Supervision, Methodology, Writing - review & editing, Funding acquisition. Jiangming Kan: Supervision, Methodology, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was funded by the National Natural Science Foundation of China (Grant number 31660239 and 32071680); the Beijing municipal construction project special fund; Science and Technology Department of Henan Province (Grant number 182102110160); Young Teachers Found of Xinyang Agriculture and Forestry University (Grant number 201701013).
References (65)
- et al.
Make3D: Learning 3D scene structure from a single still image
IEEE Trans. Pattern Anal. Mach. Intell.
(2009) - et al.
Efficient multi-view 3D video multicast with depth image based rendering in LTE-advanced networks with carrier aggregation
IEEE Trans. Mob. Comput.
(2018) - S. Zia, B. Yuksel, D. Yuret, Y. Yemez, RGB-D object recognition using deep convolutional neural networks, in: IEEE...
- X.F. Ren, L.F. Bo, D. Fox, RGB-(D) scene labeling: Features and algorithms, in: IEEE Conference on Computer vision and...
- S. Lee, S.J. Park, K.S. Hong, RDFNet: RGB-D Multi-level residual feature fusion for indoor semantic segmentation, in:...
- P.Y. Chen, A.H. Liu, Y.C. Liu, Towards scene understanding: unsupervised monocular depth estimation with semantic-aware...
- Y. Lu, J. Zhou, J. Wang, J. Chen, K. Smith, C. Wilder, S. Wang, Curve-structure segmentation from depth maps: A...
- J. Xie, R. Girshick, A. Farhadi, Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural...
- J. Levinson, J. Askeland, J. Becker, Towards fully autonomous driving: Systems and algorithms, in: Intelligent Vehicles...
- A. Geiger, Lenz. p, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: Computer...
A kind of infrared expand depth of field vision sensor in low-vissibility road condition for safety-driving
Sens. Rev.
Learning depth from single monocular images using deep convolutional neural fields
IEEE Trans. Pattern Anal. Mach. Intell.
PatchMatch Stereo-Stereo Matching with Slanted Support Windows
Relative importance of binocular disparity and motion parallax for depth estimation: A computer vision approach
Remote Sens.
Predicting depth from single RGB images with pyramidal three-streamed networks
Sensors
Encoder–decoder with densely convolutional networks for monocular depth estimation
J. Opt. Soc. Amer. A
Learning depth from single monocular images using deep convolutional neural fields
IEEE Trans. Pattern Anal. Mach. Intell.
Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks
IEEE Trans. Pattern Anal. Mach. Intell.
On the variance of the adaptive learning rate and beyond
Cited by (6)
ABC: Aligning binary centers for single-stage monocular 3D object detection
2023, Image and Vision ComputingEncoder–Decoder Structure Fusing Depth Information for Outdoor Semantic Segmentation
2023, Applied Sciences (Switzerland)Monocular stereo vision of image feature-aware interactive generation
2021, Proceedings - 2021 International Conference on Computer Technology and Media Convergence Design, CTMCD 2021