Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance
Information Fusion ( IF 18.6 ) Pub Date : 2024-03-16 , DOI: 10.1016/j.inffus.2024.102363
Xiang Wang , Haonan Luo , Zihang Wang , Jin Zheng , Xiao Bai

Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

中文翻译：

使用视觉惯性姿态变换器和单目引导的自监督多帧深度估计

自监督单目深度估计一直是一个热门话题，因为它不需要劳动密集型的深度地面实况收集。然而，单目网络的准确性受到限制，因为它只能利用单个图像中提供的上下文，而忽略视频中存在的几何线索。最近，多帧深度网络被引入自监督深度学习框架中，以改善单目深度，它通过成对成本体积构造显式地编码几何信息。在本文中，我们解决了影响成本量构建以及多帧深度估计的两个主要问题。首先，相机位姿估计确定了成本体积构造中的对极几何形状，但很少得到解决，通过额外的惯性模态得到增强。互补的视觉和惯性模态自适应地融合，通过新颖的视觉-惯性融合 Transformer 提供准确的相机姿势，其中自注意力在视觉-惯性特征交互中发挥作用，交叉注意力用于任务特征解码和姿势回归。其次，将包含场景上下文信息的单目深度先验引入到特征级别的多帧成本量聚合中。提出了一种新颖的单目引导成本量激励模块来自适应调制成本量特征并解决可能的匹配模糊性。通过所提出的模块，提出了一种自监督多帧深度估计网络，由先前的单目深度分支、集成视觉和惯性模态的相机姿态分支以及产生最终深度的多帧深度分支组成前两个分支机构的帮助。 KITTI 数据集上的实验结果表明，与最先进的竞争对手相比，我们提出的方法在多帧深度估计方面实现了显着的性能提升。与ManyDepth和MOVEDepth相比，我们的方法在KITTI数据集上的深度精度相对提高了9.2%和5.3%。

更新日期：2024-03-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>