VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-03-20 , DOI: arxiv-2303.10975
Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang

In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

中文翻译：

VIMI：用于基于相机的 3D 目标检测的车辆基础设施多视图中间融合

在自动驾驶中，车辆-基础设施协同 3D 目标检测 (VIC3D) 利用车辆和交通基础设施的多视角摄像头，提供超越单一车辆视点的具有丰富路况语义上下文的全局有利位置。VIC3D 中存在两个主要挑战：1）融合多视图图像时固有的校准噪声，由跨相机的时间异步引起；2) 将 2D 特征投影到 3D 空间时的信息丢失。为了解决这些问题，我们提出了一种新颖的 3D 对象检测框架，即车辆-基础设施多视图中间融合 (VIMI)。首先，要充分利用车辆和基础设施的整体视角，我们提出了一种多尺度交叉注意力（MCA）模块，该模块在选择性多尺度上融合基础设施和车辆特征，以校正相机异步引入的校准噪声。然后，我们设计了一个摄像头感知通道掩码 (CCM) 模块，该模块使用摄像头参数作为先验来增强融合特征。我们进一步引入了具有通道和空间压缩块的特征压缩 (FC) 模块，以减少传输特征的大小以提高效率。实验表明，VIMI 在新的 VIC3D 数据集 DAIR-V2X-C 上实现了 15.61% 的整体 AP_3D 和 21.44% 的 AP_BEV，在传输成本相当的情况下，显着优于最先进的早期融合和晚期融合方法。我们设计了一个摄像头感知通道掩码 (CCM) 模块，该模块使用摄像头参数作为先验来增强融合功能。我们进一步引入了具有通道和空间压缩块的特征压缩 (FC) 模块，以减少传输特征的大小以提高效率。实验表明，VIMI 在新的 VIC3D 数据集 DAIR-V2X-C 上实现了 15.61% 的整体 AP_3D 和 21.44% 的 AP_BEV，在传输成本相当的情况下，显着优于最先进的早期融合和晚期融合方法。我们设计了一个摄像头感知通道掩码 (CCM) 模块，该模块使用摄像头参数作为先验来增强融合功能。我们进一步引入了具有通道和空间压缩块的特征压缩 (FC) 模块，以减少传输特征的大小以提高效率。实验表明，VIMI 在新的 VIC3D 数据集 DAIR-V2X-C 上实现了 15.61% 的整体 AP_3D 和 21.44% 的 AP_BEV，在传输成本相当的情况下，显着优于最先进的早期融合和晚期融合方法。

更新日期：2023-03-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>