Full length articlePESA-Net: Permutation-Equivariant Split Attention Network for correspondence learning
Introduction
Feature matching is a fundamental and important problem for a variety of applications in computer vision [1], [2], such as Image Retrieval [3], Image Fusion [4], Image Registration [5] and Structure from Motion (SfM) [6], [7].
Given two images of the same or similar scenes, the aim of feature matching is to establish reliable feature correspondences. Note that matching feature points to another feature points may require solving an -hard assignment problem. To deal with the complex problem, a common strategy of feature matching is typically solved in a two-step manner, that is, generating a set of putative correspondences by picking out point pairs with sufficiently similar feature descriptors and establishing reliable correspondences from the generated putative ones. For the first step, the putative correspondences are usually extracted by a robust extractor, such as scale invariant feature transform (SIFT) [8]. However, the brute-force putative correspondences often contain a large number of false matches (i.e., outliers), due to the low-quality images and the constraint of local descriptor information. Thus, it is critical to design a robust approach, for establishing reliable correspondences in the second step.
Recently, learning-based methods, e.g., LGC-Net [9], DFE-Net [10], OA-Net [11] and ACNe-Net [12], have been extensively proposed for feature matching, due to the excellent performance of deep neural network. However, LGC-Net, DFE-Net and OA-Net, rely on PointCN, a PointNet-like architecture with Context Normalization, which normalizes the feature maps according mean and variance. Therefore, Context Normalization can be expressed as the solution of a least-squares problem which is not robust to outliers. To deal with the problem, ACNe-Net is proposed to capture the context information in both global and local manner, by a normalization operation. However, the normalization operation neglects channel-wise correspondence contextual information, which may lead to sub-optimal performance for feature matching.
In this paper, we propose a novel attention mechanism called “Split–Squeeze–Excitation–Union” (SSEU) module, which extracts the contextual information in a channel-wise manner, to improve the matching performance. Comparing with other state-of-the-art approaches on YFCC100M unknown scenes [13], our network introduces very few additional parameters and negligible computations while bringing notable performance gain, as shown in Fig. 1. Specifically, the SSEU module consists of four operations: Split, Squeeze, Excitation and Union, to gather the channel-wise global information from different aspects for feature matching. The Split operation generates multiple paths to exploit the geometrical context of putative correspondences from different aspects. The Squeeze operation aggregates feature-maps to produce a channel descriptor. The Excitation operation adopts the channel dependence to learn for each channel by a Multi-Layer Perceptron (MLP), to achieve the excitation of each channel. The Union operation combines and aggregates the geometrical context information from multiple paths. Note that, the SSEU module not only uses a channel-wise manner, but also includes a local and global manner.
To handle the unordered correspondence features, we follow the existing learning-based feature matching methods to build the network based on a Multi-Layer Perceptron (MLP), which is able to provide permutation equivariance, which is not feasible with neither convolutional nor fully-connected [12]. Then, we construct a Permutation-Equivariant Split Attention (PESA) block, which is fused by the MLP, SSEU module, and some normalizations. After that, by stacking the PSEA blocks together, we build our network called PESA-Net. We show the overview of our PESA-Net in Fig. 2. Note that we add Context Normalization after each MLP to enrich contextual information. In addition, we also insert the Geometric Attention Block, which contains a Differentiable Pooling Layer [14], Order Aware Filtering Block, and Differentiable Unpooling Layer [11], in the middle of each iterative sub-network to extract the local information and global information of correspondences due to the effective performance.
We summarize the contributions as follows:
- •
We develop a simple and effective attention mechanism, named “Split–Squeeze–Excitation–Union” (SSEU) module, which generates multiple paths and adopts channel-wise dependence to capture rich contextual information from different aspects in a permutation invariant manner. To the best of our knowledge, we are the first one to introduce the split-attention mechanism to handle feature matching problems.
- •
We construct a permutation-equivariant block, which consists of the SSEU module, Multi-Layer Perceptron and some normalizations, to exploit the complex global context of sparse and unordered correspondence data. In addition, we also design an iterative permutation-equivariant network by stacking the PESA block and Geometric Attention block together, for feature matching.
- •
The proposed PESA-Net achieves the state-of-the-art performance on relative pose estimation and outlier rejection tasks on both two challenging outdoor and indoor benchmarks (i.e., YFCC100M [13] and SUN3D [15]).
The rest of the paper is organized as follows: We first review the related feature matching literatures in Section 2. Then, we describe the details of the proposed method in Section 3 and present the experimental results in Section 4. Finally, we draw conclusions in Section 5.
Section snippets
Related work
In the section, we briefly introduce the learning-based feature matching methods highly related to our paper. In addition, we also review some related work of attention mechanisms.
Method
In this section, we design an iterative permutation-equivariant network (called PESA-Net) to handle the outlier rejection and geometry estimation problem. In the following, we introduce the problem formulation in Section 3.1, describe the proposed SSEU module in Section 3.2, and discuss our network architecture in Section 3.3.
Experimental results
In the section, we compare the proposed PESA-Net with a de facto standard of handcraft method (i.e., RANSAC [19]) and several state-of-the-art methods, including Point-Net++ [40], LGC-Net [9], DFE-Net [10], OA-Net++ [11] and ACNe-Net [12] on two tasks. Two publicly available datasets (i.e., YFCC100M dataset [13] and SUN3D dataset [15]) are employed both in the camera pose estimation and outlier rejection task. In the following: we first introduce the details of two datasets. After that, we
Conclusion
In this paper, we have designed a novel SSEU module, which is used to build a Permutation-Equivariant Split Attention Network (PESA-Net) for correspondence learning. The proposed SSEU module is able to gather rich contextual information from different aspects in a permutation invariant manner, by generating multiple paths and adopting channel-wise dependence. In addition, we also construct a permutation-equivariant block by fusing the SSEU module, Multi-Layer Perceptron and some normalizations,
CRediT authorship contribution statement
Zhen Zhong: Conceptualization, Methodology, Software, Writing – original draft. Guobao Xiao: Methodology, Writing – reviewing and editing, Supervision. Shiping Wang: Writing – reviewing and editing. Leyi Wei: Visualization, Investigation. Xiaoqin Zhang: Writing – reviewing and editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants 62072223, and supported by the Natural Science Foundation of Fujian Province under Grant 2020J01131199.
References (42)
- et al.
A review of multimodal image matching: Methods and applications
Inf. Fusion
(2021) - et al.
Content based image retrieval using image features information fusion
Inf. Fusion
(2019) - et al.
Multi-focus image fusion with dense SIFT
Inf. Fusion
(2015) - et al.
Evolution strategies based image registration via feature matching
Inf. Fusion
(2004) - et al.
MLESAC: A new robust estimator with application to estimating image geometry
Comput. Vis. Image Underst.
(2000) - et al.
Content based image retrieval using image features information fusion
Inf. Fusion
(2019) - et al.
Image matching from handcrafted to deep features: A survey
Int. J. Comput. Vis.
(2021) - et al.
Deterministic model fitting by local-neighbor preservation and global-residual optimization
IEEE Trans. Image Process.
(2020) - et al.
Structure-from-motion revisited
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)
Learning to find good correspondences
Deep fundamental matrix estimation
Learning two-view correspondences and geometry using order-aware network
ACNe: Attentive context normalization for robust permutation-equivariant learning
YFCC100M: The new data in multimedia research
Hierarchical graph representation learning with differentiable pooling
Sun3d: A database of big spaces reconstructed using sfm and object labels
ORB: An efficient alternative to SIFT or SURF
Lift: Learned invariant feature transform
Superpoint: Self-supervised interest point detection and description
Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography
Commun. ACM
Cited by (13)
Shared contents alignment across multiple granularities for robust SAR-optical image matching
2024, Information FusionMulti-sensor fusion fault diagnosis method of wind turbine bearing based on adaptive convergent viewable neural networks
2024, Reliability Engineering and System SafetyA unified feature-spatial cycle consistency fusion framework for robust image matching
2023, Information FusionJRA-Net: Joint representation attention network for correspondence learning
2023, Pattern RecognitionCitation Excerpt :OANet [12] relies on the differential pooling and unpooling operation to extract local context and proposes the Order-Aware Filtering Block to capture global context. PESA-Net [26] adds grouped channel attention blocks after each MLP to collect the information on the channels. T-Net [27] proposes a T-shaped network structure to integrate the information learned in different stages of the network.
Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching
2023, Information FusionCitation Excerpt :Furthermore, most of them may sacrifice considerable inliers to estimate the motion parameters, thus limiting their application scenarios [29]. Most recently, Zhong et al. [56] proposed a permutation-equivariant split attention network (PESA-Net) to improve the ability of existing PointNet-like framework in context information learning, by an attention module including split, squeeze, excitation and union operation. Specifically, PESA-Net leverages the concepts of attention mechanism in CNN [57], and further introduces multiple learning paths, thus gathering the channel-wise contextual information and bringing performance gain in two-view correspondences.
BSCA-Net: Bit Slicing Context Attention network for polyp segmentation
2022, Pattern RecognitionCitation Excerpt :For example, RMA [14] utilizes an attention mechanism to handle the image classification issue. PESA-Net [33] employs the permutation-equivariant split attention mechanism to correspondence learning. Reverse attention [10] proposes a novel attention to guide side-output residual learning for the salient object detection.