Depth prediction from 2D images: A taxonomy and an evaluation study

doi:10.1016/j.imavis.2019.11.003

Image and Vision Computing

Volume 93, January 2020, 103825

https://doi.org/10.1016/j.imavis.2019.11.003 Get rights and content

Abstract

Among the various cues that help us understand and interact with our surroundings, depth is of particular importance. It allows us to move in space and grab objects to complete different tasks. Therefore, depth prediction has been an active research field for decades and many algorithms have been proposed to retrieve depth. Some imitate human vision and compute depth through triangulation on correspondences found between pixels or handcrafted features in different views of the same scene. Others rely on simple assumptions and semantic knowledge of the structure of the scene to get the depth information. Recently, numerous algorithms based on deep learning have emerged from the computer vision community. They implement the same principles as the non-deep learning methods and leverage the ability of deep neural networks of automatically learning important features that help to solve the task. By doing so, they produce new state-of-the-art results and show encouraging prospects. In this article, we propose a taxonomy of deep learning methods for depth prediction from 2D images. We retained the training strategy as the sorting criterion. Indeed, some methods are trained in a supervised manner which means depth labels are needed during training while others are trained in an unsupervised manner. In that case, the models learn to perform a different task such as view synthesis and depth is only a by-product of this learning. In addition to this taxonomy, we also evaluate nine models on two similar datasets without retraining. Our analysis showed that (i) most models are sensitive to sharp discontinuities created by shadows or colour contrasts and (ii) the post processing applied to the results before computing the commonly used metrics can change the model ranking. Moreover, we showed that most metrics agree with each other and are thus redundant.

Introduction

Human vision provides us with various cues that help us understand our surroundings and interact with them. Among these cues, those related to depth perception are of particular importance in several tasks such as following a path, climbing stairs, avoiding or grasping objects [[1], [2], [3], [4]]. Therefore, depth estimation from images has been an active research problem for decades in computer vision and diverse solutions have been proposed to give the same abilities to machines.

Depth estimation algorithms can be divided into monocular or multi-view methods, depending on the number of images required to infer depth. Estimating depth from a single image is an ill-posed problem which is still interesting to solve. Indeed, many three-dimensional structures can have the same two-dimensional projection, but finding the real one can be convenient in various situations such as automatically converting a 2D film in 3D [[5], [6], [7]]. To that end, a few algorithms inspired by the way humans use monocular cues as well as their prior visual experience have been proposed [[8], [9], [10], [11], [12]]. On the other hand, the multi-view approach, in one of its simplest forms, imitates the human binocular vision, replacing the eyes by a stereo camera and obtaining depth through triangulation on two-dimensional correspondences found in two images of a scene taken at different angles [13, 14]. Other sources of data can be used like a sequence recorded by a hand-held camera in structure from motion methods [[15], [16], [17], [18], [19]].

Recently, new methods have been proposed, taking advantage of the emergence of deep learning and convolutional neural networks (CNN) to learn parts of existing pipelines or even the entire depth estimation process [[20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]]. Indeed, deep convolutional neural networks are particularly good at understanding global semantic, a knowledge that can lead to better results than those given by local geometry [30]. Eigen et al. were the first to use CNN in their work [20, 21] and many other models have been developed since then. In this paper, we present a taxonomy of depth estimation methods based on deep learning as well as qualitative and quantitative comparisons of the results they yield on two different datasets. The remainder of this paper is organised as follows. Since the deep learning approach shares similarities with non-deep methods, Section 2.1 gives an overview of the latter. Section 2.2 describes the different CNN-based existing methods along with their respective training strategy. In Section 3.1, the different datasets on which the algorithms have been tested in our experiments are presented. Section 4 gives a qualitative and a quantitative analysis of the results. Finally, Section 5 concludes on our work and gives some prospects for improvements of depth estimation with deep learning methods.

Section snippets

Depth prediction from 2D images: a taxonomy

Here, we divide the depth prediction algorithms in two groups: Non-deep learning methods and deep learning methods. The latter are further separated in subgroups, depending on the training strategy.

Datasets

Datasets are of a particular importance in machine learning. Indeed, they provide the examples on which models can learn the various tasks for which they are developed. Creating or choosing an existing dataset to train a model is not a straightforward process. Particular attention must be paid to several points such as selecting a dataset with enough variability to avoid overfitting and choosing one that meets the requirements of the model design. Indeed, training a network that needs depth

Results

This section presents the results of the experiments described in Section 3. It first details our qualitative analysis and then gives the results of our quantitative analysis on both datasets with the four transforms applied to the data.

Conclusion

In this article, we proposed a taxonomy of deep learning models designed to solve the problem of depth estimation from 2D images. Such models are useful within the framework of scene understanding as depth is one of the most important cues in vision. Our taxonomy uses the type of training to divide models in groups. We retained three main groups: supervised monocular methods, supervised multi-view methods and unsupervised monocular methods. Supervised monocular methods need a depth ground truth

Acknowledgements

Ambroise Moreau is funded through a PhD grant from the University of Mons, Belgium.

This research is partially funded by the European Regional Development Fund (ERDF) under the grant number ETR 1212 0000 3303.

References (1)

Cited by (2)

Surround-View Cameras based Holistic Visual Perception for Automated Driving
2022, arXiv
Unsupervised depth prediction from monocular sequences: Improving performances through instance segmentation
2020, Proceedings - 2020 17th Conference on Computer and Robot Vision, CRV 2020

^☆: This paper has been recommended for acceptance by Sinisa Todorovic.

View full text

Depth prediction from 2D images: A taxonomy and an evaluation study☆