Skip to main content
Log in

Unsupervised Scale-Consistent Depth Learning from Video

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation. The source code is released on GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.

    Article  Google Scholar 

  • Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190.

    Google Scholar 

  • Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS).

  • Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC).

  • Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:2006.02708.

  • Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568).

  • Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625).

  • Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI).

  • Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC).

  • Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS).

  • Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613).

  • Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072).

  • Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:1511.07289.

  • Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV).

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS).

  • Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.

    Article  Google Scholar 

  • Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE.

  • Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011).

  • Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer.

  • Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637).

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR).

  • Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV).

  • Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV).

  • Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV).

  • Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR).

  • Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814).

  • Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer.

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS).

  • Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996.

  • Kingma, D. P., & Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

  • Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE.

  • Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer.

  • Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV.

  • Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

  • Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning.

  • Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV).

  • Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE.

  • Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE.

  • Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530).

  • Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050).

  • Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10).

  • Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.

    Article  Google Scholar 

  • Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.

    Article  Google Scholar 

  • Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH).

  • Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070).

  • Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5).

  • Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE.

  • Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO).

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W.

  • Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE.

  • Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595).

  • Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783.

    Google Scholar 

  • Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI).

  • Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer.

  • Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV).

  • Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS).

  • Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113).

  • Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV).

  • Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE.

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV).

  • Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS).

  • Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252).

  • Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer.

  • Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:1704.07804.

  • Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE.

  • Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4).

  • Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320).

  • Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292).

  • Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV).

  • Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI).

  • Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV).

  • Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878).

  • Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.

    Article  Google Scholar 

  • Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV).

  • Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847.

  • Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV).

  • Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV).

Download references

Acknowledgements

This work was in part supported by the Australian Centre of Excellence for Robotic Vision CE140100016, and the ARC Laureate Fellowship FL130100102 to Prof. Ian Reid. This work was supported by Major Project for New Generation of AI (No. 2018AAA0100403), Tianjin Natural Science Foundation (No. 18JCYBJC41300 and No. 18ZXZNGX00110), and NSFC (61922046) to Prof. Ming-Ming Cheng. We also thank anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia-Wang Bian.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bian, JW., Zhan, H., Wang, N. et al. Unsupervised Scale-Consistent Depth Learning from Video. Int J Comput Vis 129, 2548–2564 (2021). https://doi.org/10.1007/s11263-021-01484-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01484-6

Keywords

Navigation