HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-03-21 , DOI: arxiv-2303.11616
Hao Ai, Zidong cao, Yan-pei Cao, Ying Shan, Lin Wang

Depth estimation from a monocular 360{\deg} image is a burgeoning problem owing to its holistic sensing of a scene. Recently, some methods, \eg, OmniFusion, have applied the tangent projection (TP) to represent a 360{\deg}image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging plenty of patches; 2) capturing less holistic-with-regional contextual information by directly regressing the depth value of each pixel. In this paper, we propose a novel framework, \textbf{HRDFuse}, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the \textit{holistic} contextual information from the ERP and the \textit{regional} structural information from the TP. Firstly, we propose a spatial feature alignment (\textbf{SFA}) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification (\textbf{CDDC}) module that learns the \textbf{holistic-with-regional} histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from ERP and TP to obtain the final depth map. Extensive experiments show that our method predicts\textbf{ more smooth and accurate depth} results while achieving \textbf{favorably better} results than the SOTA methods.

中文翻译：

HRDFuse：通过协作学习整体区域深度分布进行单目 360° 深度估计

由于对场景的整体感知，单眼 360{\deg} 图像的深度估计是一个新兴的问题。最近，一些方法，例如 OmniFusion，已经应用切线投影 (TP) 来表示 360{\deg} 图像并通过 patch-wise 回归预测深度值，这些值被合并以获得等距柱状投影 (ERP) 的深度图）格式。然而，这些方法存在以下问题：1）合并大量补丁的非平凡过程；2）通过直接回归每个像素的深度值来捕获较少的整体区域上下文信息。在本文中，我们提出了一个新的框架，\textbf{HRDFuse}，通过协同学习来自 ERP 的 \textit{holistic} 上下文信息和来自 TP 的 \textit{regional} 结构信息，巧妙地结合了卷积神经网络 (CNN) 和转换器的潜力。首先，我们提出了一个空间特征对齐 (\textbf{SFA}) 模块，该模块学习 TP 和 ERP 之间的特征相似性，以将 TP 特征以像素方式聚合成完整的 ERP 特征图。其次，我们提出了一个协作深度分布分类 (\textbf{CDDC}) 模块，该模块学习捕获 ERP 和 TP 深度分布的 \textbf{holistic-with-regional} 直方图。因此，最终的深度值可以预测为直方图 bin 中心的线性组合。最后，我们自适应地结合来自 ERP 和 TP 的深度预测以获得最终的深度图。

更新日期：2023-03-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文