Depth prediction from 2D images: A taxonomy and an evaluation study☆
Introduction
Human vision provides us with various cues that help us understand our surroundings and interact with them. Among these cues, those related to depth perception are of particular importance in several tasks such as following a path, climbing stairs, avoiding or grasping objects [[1], [2], [3], [4]]. Therefore, depth estimation from images has been an active research problem for decades in computer vision and diverse solutions have been proposed to give the same abilities to machines.
Depth estimation algorithms can be divided into monocular or multi-view methods, depending on the number of images required to infer depth. Estimating depth from a single image is an ill-posed problem which is still interesting to solve. Indeed, many three-dimensional structures can have the same two-dimensional projection, but finding the real one can be convenient in various situations such as automatically converting a 2D film in 3D [[5], [6], [7]]. To that end, a few algorithms inspired by the way humans use monocular cues as well as their prior visual experience have been proposed [[8], [9], [10], [11], [12]]. On the other hand, the multi-view approach, in one of its simplest forms, imitates the human binocular vision, replacing the eyes by a stereo camera and obtaining depth through triangulation on two-dimensional correspondences found in two images of a scene taken at different angles [13, 14]. Other sources of data can be used like a sequence recorded by a hand-held camera in structure from motion methods [[15], [16], [17], [18], [19]].
Recently, new methods have been proposed, taking advantage of the emergence of deep learning and convolutional neural networks (CNN) to learn parts of existing pipelines or even the entire depth estimation process [[20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]]. Indeed, deep convolutional neural networks are particularly good at understanding global semantic, a knowledge that can lead to better results than those given by local geometry [30]. Eigen et al. were the first to use CNN in their work [20, 21] and many other models have been developed since then. In this paper, we present a taxonomy of depth estimation methods based on deep learning as well as qualitative and quantitative comparisons of the results they yield on two different datasets. The remainder of this paper is organised as follows. Since the deep learning approach shares similarities with non-deep methods, Section 2.1 gives an overview of the latter. Section 2.2 describes the different CNN-based existing methods along with their respective training strategy. In Section 3.1, the different datasets on which the algorithms have been tested in our experiments are presented. Section 4 gives a qualitative and a quantitative analysis of the results. Finally, Section 5 concludes on our work and gives some prospects for improvements of depth estimation with deep learning methods.
Section snippets
Depth prediction from 2D images: a taxonomy
Here, we divide the depth prediction algorithms in two groups: Non-deep learning methods and deep learning methods. The latter are further separated in subgroups, depending on the training strategy.
Datasets
Datasets are of a particular importance in machine learning. Indeed, they provide the examples on which models can learn the various tasks for which they are developed. Creating or choosing an existing dataset to train a model is not a straightforward process. Particular attention must be paid to several points such as selecting a dataset with enough variability to avoid overfitting and choosing one that meets the requirements of the model design. Indeed, training a network that needs depth
Results
This section presents the results of the experiments described in Section 3. It first details our qualitative analysis and then gives the results of our quantitative analysis on both datasets with the four transforms applied to the data.
Conclusion
In this article, we proposed a taxonomy of deep learning models designed to solve the problem of depth estimation from 2D images. Such models are useful within the framework of scene understanding as depth is one of the most important cues in vision. Our taxonomy uses the type of training to divide models in groups. We retained three main groups: supervised monocular methods, supervised multi-view methods and unsupervised monocular methods. Supervised monocular methods need a depth ground truth
Acknowledgements
Ambroise Moreau is funded through a PhD grant from the University of Mons, Belgium.
This research is partially funded by the European Regional Development Fund (ERDF) under the grant number ETR 1212 0000 3303.
References (1)
Cited by (2)
Unsupervised depth prediction from monocular sequences: Improving performances through instance segmentation
2020, Proceedings - 2020 17th Conference on Computer and Robot Vision, CRV 2020
- ☆
This paper has been recommended for acceptance by Sinisa Todorovic.