Monocular image depth prediction without depth sensors: An unsupervised learning method

https://doi.org/10.1016/j.asoc.2020.106804Get rights and content

Highlights

  • This method realizes an unsupervised learning of monocular depth estimation without depth sensors.

  • The depth prediction problem is considered as the regression problem of disparity maps.

  • A new encoder–decoder with Squeeze-and-Excitation blocks model to recover multiscale depth information.

  • A novel variant of Adam is introduced to improve the convergence speed and prediction accuracy.

Abstract

Monocular image depth prediction is an interesting challenge in three-dimensional (3D) perception, the purpose of which is to obtain the geometric features of 3D scenes from two-dimensional (2D) images. At present, the deep learning method for monocular depth prediction has yielded good results, but this approach treats it as a supervised deep regression problem. A significant weakness of current methods is the need to collect reams of depth measurement data in actual scenarios for training. In this paper, we design a novel convolutional neural network (CNN) with an encoding and decoding structure to estimate the depth map from monocular RGB images based on basic principles of binocular stereo vision, and use rectified stereo pairs to train our network from scratch in an unsupervised learning method without any depth data. We also explore a new upsampling strategy to improve the output resolution, and introduce a new dynamic optimization strategy to enhance the training speed and prediction accuracy. Extensive experiments on the publicly available KITTI and Cityscapes datasets demonstrate that our approach is more accurate than competing methods. The findings of the proposed methodology illustrate that our CNN model can be utilized as depth completion from LIDAR images.

Introduction

Depth prediction is a hot area of research in the domain of computer and robot vision. In various application scenarios, such as 3D reconstruction [1], [2], scene recognition [3], [4], semantic segmentation [5], [6], [7], augmented reality [8], assisted driving and automatic guidance [9], [10], [11], the depth information is a crucial clue to understand the geometric relationship of the scene. In this paper, we specifically focus on depth estimation problem without depth sensors or ground truth depth data in monocular imaging conditions.

The depth information is often lost in the process of mapping an image from 3D space to 2D plane, and this means that an image captured by camera may correspond to many real-life scenarios. Depth sensors, such as the Microsoft Kinect and LIDAR, are preferred by researchers to obtain the depth information quickly and effectively. However, these sensors are cost-prohibitive and sensitive to strong sunlight and tend to be burdensome in some practical applications [12]. Stereo vision approaches, attractive options over the past few years, rely on two or more rectified images to acquire depth map by indirect calculation [13], [14], [15], [16]. Although it is cheap, lightweight, and more suitable for indoor and outdoor scenes compared with depth sensors, the stereo vision method has its own disadvantages, such as the need for careful rectification to enhance the image quality, and much time is required to obtain an accurate disparity map, and ambiguous predictions are generated in the case of texture-less or confusing textured surfaces. For feature extraction methods, by extracting typical monocular clues involving texture changes, occlusions and object sizes, which typically formulate the depth estimation problem as a Markov random field (MRF) learning problem [1], [17]. However, most monocular clues are “context information”, which are global attributes of an image and cannot be inferred from small image blocks.

Recently, CNN have achieved remarkable success in monocular depth estimation areas, showing that convolutional features are superior to manually extracted ones [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29]. By learning the prior knowledge of objects and their depth, these networks yield better results in indoor or outdoor restricted evaluation scenarios. In earlier research, CNN were often combined with conditional random field (CRF) [24], [25], [26], [27], Fourier Frequency Domain (FFD) [28] and random forest [29] approaches to learn the potential correlations between pixel color and depth. However, these methods have high complexity and the predicted results have poorer resolution than input images. To restore the degraded depth map resolution, some methods adopt the structure of encoding and decoding to realize the conversion from the RGB images domain to the depth maps [19], [20], [23], [24]. The encoder extracts multiscale spatial features, and the decoder adopts a multistage upsampling structure to recover the detailed information of objects in the depth maps by the end-to-end training method. However, a major weakness of the above methods is that they rely on large-scale datasets with ground truth depth maps as the supervision signals. Collecting this datasets in a variety of real-word scenarios is time-consuming and expensive, for example, requiring expensive lasers and depth cameras to collect depth data.

It is not difficult to find that depth sensors are expensive, and depth information may be lost in some practical applications (such as the surface of an object is too smooth). Traditional machine learning methods are greatly affected by the quality of the extracted features. A major weakness of using supervised learning to train CNN is that it is difficult to obtain large-scale and high-quality datasets with ground truth depth maps. While some researchers have proposed unsupervised methods, we still need to further improve the accuracy of prediction to narrow the gap with supervised learning methods. To solve the above issues, we design a novel CNN with an encoding and decoding structure to estimate the depth map, and introduce a new dynamic optimization strategy to enhance the training speed and prediction accuracy. First, the depth prediction problem is considered as the regression problem of disparity maps according to binocular disparity and epipolar geometry principles in our work, and we use a novel CNN model to output the depth map by given rectified stereo image pairs, no other ground truth depth data is needed. Second, we design our encoder–decoder architecture based on some Squeeze-and-Excitation (SE) blocks [30] and add some short-cut connections between corresponding layers in the encoder and decoder to estimate high detail depth information. Finally, we present a new data-dependent upsampling method that supports the network to automatically acquire the importance of each feature channel by learning, and we employ a novel optimizer [31] to improve the convergence speed. When evaluate the performance on the KITTI Split [10], our method achieved the best results on 8 evaluation metrics. On the KITTI Eigen Split [18], our method achieved the best results on 5 and the second best results on 2 evaluation metrics. Different from the previous related work, we evaluate the performance of depth estimation on the Cityscapes [32] dataset demonstrates the effectiveness of our approach.

The rest of paper is organized as follows. In Section 2, we review recent advances in depth estimation from monocular images. In Section 3, we describe the unsupervised learning method of monocular image depth prediction in detail, including three elementary theory statements, network architecture, data augmentation method, optimizer, activation function and loss function construction. In Section 4, based on the established model, we demonstrate the validity and reliability of our method through experimentation and analysis. Finally, the conclusions and future works are drawn up in Section 5.

Section snippets

Related work

Early related work mainly depended on artificially extracted features and probabilistic graphics models. Saxena et al. [17] constructed a depth prediction model applying a supervised learning method, which used multiscale MRF to aggregate the multiscale local and global image features to model the depth relationship between different pixel points in the image. Karsch et al. [33] introduced a technique for automatically generating a trusted depth map from video by nonparametric depth-sampling

The proposed approach

We give detailed description on monocular image depth prediction in this section. First, based on the basic theory of pinhole model, binocular stereo vision and epipolar geometry, we describe our framework for disparity map regression. Second, we also discuss each component of our method, including the data augmentation techniques, appropriate optimizer, activation function and loss function. Finally, we present how to learn network parameter in our unsupervised CNN framework in detail.

Experimentation

In this section, we performed extensive experiments on public benchmark data, including KITTI dataset and Cityscapes dataset, to verify the validity of our model. During training, our method does not require any depth data. Table 4 reports the size of input, feature map and output in our network.

Conclusions

We present a new network structure for predicting dense depth maps from rectified stereo pairs, and it can be trained in an unsupervised end-to-end manner without ground truth depth maps or any pretrained models. To improve the accuracy of the depth prediction and the convergence speed of network, we first propose an encoder–decoder with SE block structure and a novel upsampling strategy to increase the estimation accuracy. Furthermore, our method shows promising performance compared to other

CRediT authorship contribution statement

Songnan Chen: Writing - original draft, Methodology, Visualization, Investigation, Validation, Formal analysis. Mengxia Tang: Conceptualization, Supervision, Methodology, Writing - review & editing, Funding acquisition. Jiangming Kan: Supervision, Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (Grant number 31660239 and 32071680); the Beijing municipal construction project special fund; Science and Technology Department of Henan Province (Grant number 182102110160); Young Teachers Found of Xinyang Agriculture and Forestry University (Grant number 201701013).

References (65)

  • SaxenaA. et al.

    Make3D: Learning 3D scene structure from a single still image

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • LeeJ.T. et al.

    Efficient multi-view 3D video multicast with depth image based rendering in LTE-advanced networks with carrier aggregation

    IEEE Trans. Mob. Comput.

    (2018)
  • S. Zia, B. Yuksel, D. Yuret, Y. Yemez, RGB-D object recognition using deep convolutional neural networks, in: IEEE...
  • X.F. Ren, L.F. Bo, D. Fox, RGB-(D) scene labeling: Features and algorithms, in: IEEE Conference on Computer vision and...
  • S. Lee, S.J. Park, K.S. Hong, RDFNet: RGB-D Multi-level residual feature fusion for indoor semantic segmentation, in:...
  • P.Y. Chen, A.H. Liu, Y.C. Liu, Towards scene understanding: unsupervised monocular depth estimation with semantic-aware...
  • Y. Lu, J. Zhou, J. Wang, J. Chen, K. Smith, C. Wilder, S. Wang, Curve-structure segmentation from depth maps: A...
  • J. Xie, R. Girshick, A. Farhadi, Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural...
  • J. Levinson, J. Askeland, J. Becker, Towards fully autonomous driving: Systems and algorithms, in: Intelligent Vehicles...
  • A. Geiger, Lenz. p, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: Computer...
  • WangH.F. et al.

    A kind of infrared expand depth of field vision sensor in low-vissibility road condition for safety-driving

    Sens. Rev.

    (2016)
  • LiuF.Y. et al.

    Learning depth from single monocular images using deep convolutional neural fields

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • P. Heise, S. Klose, B. Jensen, A. Knoll, PM-Huber: PatchMatch with huber regularization for stereo matching, in: IEEE...
  • BleyerM. et al.

    PatchMatch Stereo-Stereo Matching with Slanted Support Windows

    (2011)
  • M. Poggi, D. Pallotti, F. Tosi, S. Mattoccia, Guided stereo matching, in: IEEE Conference on Computer Vision and...
  • MansourM. et al.

    Relative importance of binocular disparity and motion parallax for depth estimation: A computer vision approach

    Remote Sens.

    (2019)
  • A. Saxena, S.H. Chung, A.Y. Ng, Learning depth from single monocular images, in: International Conference on Neural...
  • D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in:...
  • I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab, Deeper depth prediction with fully convolutional residual...
  • ChenS. et al.

    Predicting depth from single RGB images with pyramidal three-streamed networks

    Sensors

    (2019)
  • H. Fu, M.M. Gong, C.H. Wang, K. Batmanghelich, D.H. Tao, Deep ordinal regression network for monocular depth...
  • D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, E. Ricci, Structured attention guided convolutional neural fields for...
  • ChenS.N. et al.

    Encoder–decoder with densely convolutional networks for monocular depth estimation

    J. Opt. Soc. Amer. A

    (2019)
  • B. Li, C.H. Shen, Y.C. Dai, A.V.D. Hengel, M.Y. He, Depth and surface normal estimation from monocular images using...
  • D. Xu, E. Ricci, W. Ouyang, X. Wang, N. Sebe, Multi-scale continuous CRFs as sequential deep networks for monocular...
  • LiuF. et al.

    Learning depth from single monocular images using deep convolutional neural fields

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • XuD. et al.

    Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • J.H. Lee, M. Heo, K.R. Kim, C.S. Kim, Single-image depth estimation based on fourier domain analysis, in: IEEE...
  • A. Roy, S. Todorovic, Monocular depth estimation using neural regression forest, in: IEEE Conference on Computer Vision...
  • J. Hu, L. Shen, S. Albanie, G. Sun, E.H. Wu, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision...
  • LiuL.Y. et al.

    On the variance of the adaptive learning rate and beyond

    (2019)
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes...
  • Cited by (6)

    View full text