Unsupervised monocular visual odometry with decoupled camera pose estimation

https://doi.org/10.1016/j.dsp.2021.103052Get rights and content

Abstract

Drift or error accumulation is an inevitable challenge in visual odometry (VO). To alleviate this issue, most of learning-based VO methods focus on various long and short term sequential learning schemes, while lose sight of the fact that inaccurate rotation estimate is the main source of VO drift. They usually estimate the six degrees of freedom (DoFs) of the camera motion simultaneously, without considering the inherent rotation-translation ambiguity. In this paper, we start from the designs of a cascade decoupled structure and a residual-based decoupled pose refinement scheme for accurate pose estimation. Then we extend them to an unsupervised monocular VO framework, which estimates the 3D camera poses by decoupling the estimations of rotation, translation and scale. Our VO model consists of three components: a monocular depth estimation, a decoupled pose estimation and a decoupled pose refinement. The first component learns the metric scale and depth cues by using stereo pairs for training, and predicts the absolute depth of monocular inputs. The latter two separate the estimation and refinement of rotation and translation. To improve the robustness of the rotation estimation, we use the unit quaternion, instead of the Euler angles, to represent 3D rotation. We have evaluated our model on the KITTI Visual Odometry Evaluation benchmark. Comparison experiments demonstrate that our method is superior to the state-of-the-art unsupervised VO methods, and can achieve comparable results with the supervised ones.

Introduction

Visual odometry(VO) [1] recovers the trajectory of a moving platform from the image sequences captured by one or more cameras mounted on the moving platform. The key of VO is to estimate the 6-DOF camera pose relative to a fixed reference frame. As an essential task in computer vision, the applications of VO range from autonomous driving and robotics to augmented reality, etc. In the past few decades, it has drawn great attention of the researchers from both academia and industry.

Benefiting from the learning ability of convolutional neural networks (CNN), many CNN-based VO methods have been proposed, and made significant progress in recent years [2], [3], [4]. They trained CNN models in an end-to-end supervised / unsupervised fashion, which takes several consecutive frames as input, and outputs the relative camera motion between frames. Obviously, inaccurate motion estimation inevitably results in drift errors that accumulate over time. In order to alleviate the effect of VO drift, most of the existing VO networks focus on various long and short term sequential learning schemes. However, they lose sight of the fact that inaccurate rotation estimate is the main source of VO drift [5], [6]. They usually use a single fully convolutional network (FCN) to simultaneously estimate the rotation and translation of camera motion, without considering the inherent rotation-translation ambiguity.

In this paper, we first focus on designing a cascade decoupled structure for camera motion estimation and refinement, which formulates the camera motion estimation in unit quaternion and decouples the rotation and the translation estimation. We then extend this structure into a general unsupervised monocular VO framework. Note that we adopt an unsupervised paradigm instead of a supervised one for two reasons:

  • 1)

    Due to the lack of guidance and constraint from the ground truth pose information, unsupervised methods are more susceptible to the drift problem, that makes them far less accurate than the supervised VO methods.

  • 2)

    Large VO datasets with high quality annotations are often difficult and expensive to obtain. Unsupervised methods are less affected by the dataset limitation. They usually learn depth and camera motion jointly without any ground truth information.

To the best of our knowledge, only limited learning-based research has been conducted on the decoupled pose estimation [7], [8], [9]. Compared with the state-of-the-art unsupervised VO methods, the main contributions of our work are in two aspects:

  • 1)

    The design of a cascade decoupled structure for camera motion estimation. It effectively separates the rotation and translation estimation, which significantly improves the accuracy of rotation estimation, and then further promotes the accuracy of translation estimation. Additionally, we employ unit quaternions [10], [11] for the rotation representation to increase the robustness of rotation estimation.

  • 2)

    An unsupervised VO framework with decoupled camera pose estimation and refinement. It estimates the camera motion by decoupling estimations of rotation, translation and scale, and allows solving in cascade for these three transformations. Our framework consists of three components: a monocular depth estimation, a decoupled pose estimation and a residual-based decoupled pose refinement. We do not adopt any sequential learning schemes with the aim to show the effectiveness of our cascade decoupled structure clearly.

The evaluation experiments on the public KITTI Visual Odometry Evaluation benchmark [12] manifest the advantages and effectiveness of our paradigm. Our model outperforms the state-of-the-art unsupervised VO methods, and can achieve comparable results with the supervised ones.

Section snippets

Related work

In the last decade, many innovations on ego-motion estimation or VO have been proposed. To mitigate the drift effect, many methods combine the additional, non-visual sensors, such as the inertial measurement unit (IMU), LiDAR sensor, etc. Here we briefly review the previous work relevant to the pure vision-based methods, which can be roughly divided into two categories: geometry-based and learning-based methods.

Network architecture

We design an unsupervised monocular VO model with decoupled pose estimation, which estimates the 3D camera motion by the decoupling estimations of rotation, translation and scale. It consists of three components: an unsupervised monocular depth estimation DispNet, a decoupled pose estimation, and a residual-based decoupled pose refinement. Fig. 1 depicts the schematic architecture of our network. To demonstrate the validity of our cascade decoupling idea, we do not adopt any long and short term

Unsupervised loss functions

We adopt four unsupervised loss functions: the left-right photometric loss Llr, the disparity smoothness loss Lsmooth, the rigid warping loss Lrigid, and the pose cycle consistency loss Lloop. The total loss can be written asLtotal=λbLlr+λsLsmooth+λrLrigid+λlLloop, where λb, λs, λr and λl denote the weights to balance the contribution of each loss term.

Experimental evaluation

In this section, we first introduce the implementation details of our network. Then we evaluate our method on the public KITTI datasets, and analyze the impact of different network components on the performance of our method.

Network ablation studies

To study the impact of the proposed components on pose estimation performance, the network ablation experiments with various configurations (as shown in Table 8) are carried out on KITTI dataset.

The quantitative comparisons in Table 9 show the contribution of each component. It is easy to arrive at the following conclusions:

  • 1)

    The baseline, named “Ours(Baseline)”, is our framework without the pose cycle loss, decoupled pose estimation and refinement. Our framework with the proposed cascade

Conclusions

In this paper, we introduce the rotation-translation decoupling idea into the pose estimation network to alleviate the VO drift problem. We first theoretically derive and design a cascade decoupled structure for camera motion estimation. It can improve the accuracy of rotation estimation by separating the rotation and translation estimation in a cascade decoupled fashion. To improve the robustness of the rotation estimation, unit quaternions are adopted for rotation representation instead of

CRediT authorship contribution statement

Lili Lin: Conceptualization, Validation, Writing – original draft. Weisheng Wang: Data curation, Software. Wan Luo: Data curation, Software. Lesheng Song: Data curation, Software. Wenhui Zhou: Formal analysis, Investigation, Methodology, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported in part by Zhejiang Provincial Natural Science Foundation of China (LY21F010007), National Key Research and Development Program of China (2017YFE0118200), Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province (2020E10010) and the Fundamental Research Funds for the Provincial Universities of Zhejiang (GK209907299001-008).

Lili Lin is currently an Associate Professor in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She received the Ph.D degree in Information and Communication Engineering from Zhejiang University, China, in 2005. From 2012 to 2013, she was also a visiting scholar at Indiana University, USA. Her research interests include signal processing, image processing and computational photography.

References (55)

  • H. Bay et al.

    SURF: speeded up robust features

    Comput. Vis. Image Underst.

    (2008)
  • J.-C. Bazin et al.

    Motion estimation by decoupling rotation and translation in catadioptric vision

    Comput. Vis. Image Underst.

    (2010)
  • D. Scaramuzza et al.

    Visual odometry

    IEEE Robot. Autom. Mag.

    (2011)
  • S. Wang et al.

    DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks

  • F. Xue et al.

    Beyond tracking: selecting memory and refining poses for deep visual odometry

  • Y. Li et al.

    Pose graph optimization for unsupervised monocular visual odometry

  • L. Carlone et al.

    Initialization techniques for 3D SLAM: a survey on rotation estimation and its use in pose graph optimization

  • P. Kim et al.

    Low-drift visual odometry in structured environments by decoupling rotational and translational motion

  • R. Li et al.

    Indoor relocalization in challenging environments with dual-stream convolutional neural networks

    IEEE Trans. Autom. Sci. Eng.

    (2018)
  • F. Xue et al.

    Guided feature selection for deep visual odometry

  • R. Li et al.

    UnDeepVO: monocular visual odometry through unsupervised deep learning

  • A. Kendall et al.

    Posenet: a convolutional network for real-time 6-dof camera relocalization

  • R. Mur-Artal et al.

    ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras

    IEEE Trans. Robot.

    (2017)
  • A. Geiger et al.

    Are we ready for autonomous driving? The KITTI vision benchmark suite

  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • M.A. Fischler et al.

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • N. Snavely et al.

    Modeling the world from Internet photo collections

    Int. J. Comput. Vis.

    (2008)
  • H. Strasdat et al.

    Scale drift-aware large scale monocular SLAM

    Robot. Sci. Syst.

    (2010)
  • A.J. Davison

    Real-time simultaneous localisation and mapping with a single camera

  • R. Mur-Artal et al.

    ORB-SLAM: a versatile and accurate monocular SLAM system

    IEEE Trans. Robot.

    (2015)
  • M. Kaess et al.

    Flow separation for fast and robust stereo odometry

  • R. Martins et al.

    An efficient rotation and translation decoupled initialization from large field of view depth images

  • P. Kim et al.

    Visual odometry with drift-free rotation estimation using indoor scene regularities

  • B. Guan et al.

    Visual odometry using a homography formulation with decoupled rotation and translation estimation using minimal solutions

  • K. Konda et al.

    Learning visual odometry with a convolutional network

  • G. Costante et al.

    Exploring representation learning with cnns for frame-to-frame ego-motion estimation

    IEEE Robot. Autom. Lett.

    (2016)
  • B. Ummenhofer et al.

    Demon: depth and motion network for learning monocular stereo

  • Cited by (0)

    Lili Lin is currently an Associate Professor in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She received the Ph.D degree in Information and Communication Engineering from Zhejiang University, China, in 2005. From 2012 to 2013, she was also a visiting scholar at Indiana University, USA. Her research interests include signal processing, image processing and computational photography.

    Weisheng Wang is currently a master degree candidate in the School of Computer Science and Technology at Hangzhou Dianzi University, China. He received the B.S. degree in Computer Science and Technology from Hangzhou Dianzi University in 2018. His research interests include image processing, computational photography and machine learning.

    Wan Luo is currently a master degree candidate in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. He received the B.S. degree in Electronic Information Engineering from Harbin Far East Institute of Science and Technology, China, in 2019. His research interests include image processing, computational photography and machine learning.

    Lesheng Song received the M.S. degree in Information and Communication Engineering from Zhejiang Gongshang University, China, in 2021. He received the B.S. degree in Electronic Information Engineering from ChuZhou University, China, in 2018. His research interests include image processing, computational photography and machine learning.

    Wenhui Zhou is currently a Professor in the School of Computer Science and Technology at Hangzhou Dianzi University, China. He received his Ph.D in Information and Communication Engineering from Zhejiang University in 2005. From 2005 to 2007, he worked as a postdoctoral researcher in the Department of Information Science and Electronic Engineering at Zhejiang University. From 2015 to 2016, he was also a visiting scholar at Indiana University Bloomington, USA. His research interests include image processing, medical image analysis, computer vision, computational photography, and artificial intelligence.

    View full text