PESA-Net: Permutation-Equivariant Split Attention Network for correspondence learning

doi:10.1016/j.inffus.2021.07.018

Information Fusion

Volume 77, January 2022, Pages 81-89

https://doi.org/10.1016/j.inffus.2021.07.018 Get rights and content

Highlights

•
We propose an iterative network to establish reliable correspondences.
•
We develop an attention mechanism to capture rich contextual information.
•
We construct a novel block to exploit the global context of unordered data.
•
The proposed network achieves the state-of-the-art performance.

Abstract

Establishing reliable correspondences by a deep neural network is an important task in computer vision, and it generally requires permutation-equivariant architecture and rich contextual information. In this paper, we design a Permutation-Equivariant Split Attention Network (called PESA-Net), to gather rich contextual information for the feature matching task. Specifically, we propose a novel “Split–Squeeze–Excitation–Union” (SSEU) module. The SSEU module not only generates multiple paths to exploit the geometrical context of putative correspondences from different aspects, but also adaptively captures channel-wise global information by explicitly modeling the interdependencies between the channels of features. In addition, we further construct a block by fusing the SSEU module, Multi-Layer Perceptron and some normalizations. The proposed PESA-Net is able to effectively infer the probabilities of correspondences being inliers or outliers and simultaneously recover the relative pose by essential matrix. Experimental results demonstrate that the proposed PESA-Net relative surpasses state-of-the-art approaches for pose estimation and outlier rejection on both outdoor scenes and indoor scenes (i.e., YFCC100M and SUN3D). Source codes: https://github.com/x-gb/PESA-Net.

Introduction

Feature matching is a fundamental and important problem for a variety of applications in computer vision [1], [2], such as Image Retrieval [3], Image Fusion [4], Image Registration [5] and Structure from Motion (SfM) [6], [7].

Given two images of the same or similar scenes, the aim of feature matching is to establish reliable feature correspondences. Note that matching $N$ feature points to another $N$ feature points may require solving an $N P$ -hard assignment problem. To deal with the complex problem, a common strategy of feature matching is typically solved in a two-step manner, that is, generating a set of putative correspondences by picking out point pairs with sufficiently similar feature descriptors and establishing reliable correspondences from the generated putative ones. For the first step, the putative correspondences are usually extracted by a robust extractor, such as scale invariant feature transform (SIFT) [8]. However, the brute-force putative correspondences often contain a large number of false matches (i.e., outliers), due to the low-quality images and the constraint of local descriptor information. Thus, it is critical to design a robust approach, for establishing reliable correspondences in the second step.

Recently, learning-based methods, e.g., LGC-Net [9], DFE-Net [10], OA-Net [11] and ACNe-Net [12], have been extensively proposed for feature matching, due to the excellent performance of deep neural network. However, LGC-Net, DFE-Net and OA-Net, rely on PointCN, a PointNet-like architecture with Context Normalization, which normalizes the feature maps according mean and variance. Therefore, Context Normalization can be expressed as the solution of a least-squares problem which is not robust to outliers. To deal with the problem, ACNe-Net is proposed to capture the context information in both global and local manner, by a normalization operation. However, the normalization operation neglects channel-wise correspondence contextual information, which may lead to sub-optimal performance for feature matching.

In this paper, we propose a novel attention mechanism called “Split–Squeeze–Excitation–Union” (SSEU) module, which extracts the contextual information in a channel-wise manner, to improve the matching performance. Comparing with other state-of-the-art approaches on YFCC100M unknown scenes [13], our network introduces very few additional parameters and negligible computations while bringing notable performance gain, as shown in Fig. 1. Specifically, the SSEU module consists of four operations: Split, Squeeze, Excitation and Union, to gather the channel-wise global information from different aspects for feature matching. The Split operation generates multiple paths to exploit the geometrical context of putative correspondences from different aspects. The Squeeze operation aggregates feature-maps to produce a channel descriptor. The Excitation operation adopts the channel dependence to learn for each channel by a Multi-Layer Perceptron (MLP), to achieve the excitation of each channel. The Union operation combines and aggregates the geometrical context information from multiple paths. Note that, the SSEU module not only uses a channel-wise manner, but also includes a local and global manner.

To handle the unordered correspondence features, we follow the existing learning-based feature matching methods to build the network based on a Multi-Layer Perceptron (MLP), which is able to provide permutation equivariance, which is not feasible with neither convolutional nor fully-connected [12]. Then, we construct a Permutation-Equivariant Split Attention (PESA) block, which is fused by the MLP, SSEU module, and some normalizations. After that, by stacking the PSEA blocks together, we build our network called PESA-Net. We show the overview of our PESA-Net in Fig. 2. Note that we add Context Normalization after each MLP to enrich contextual information. In addition, we also insert the Geometric Attention Block, which contains a Differentiable Pooling Layer [14], Order Aware Filtering Block, and Differentiable Unpooling Layer [11], in the middle of each iterative sub-network to extract the local information and global information of correspondences due to the effective performance.

We summarize the contributions as follows:

•
We develop a simple and effective attention mechanism, named “Split–Squeeze–Excitation–Union” (SSEU) module, which generates multiple paths and adopts channel-wise dependence to capture rich contextual information from different aspects in a permutation invariant manner. To the best of our knowledge, we are the first one to introduce the split-attention mechanism to handle feature matching problems.
•
We construct a permutation-equivariant block, which consists of the SSEU module, Multi-Layer Perceptron and some normalizations, to exploit the complex global context of sparse and unordered correspondence data. In addition, we also design an iterative permutation-equivariant network by stacking the PESA block and Geometric Attention block together, for feature matching.
•
The proposed PESA-Net achieves the state-of-the-art performance on relative pose estimation and outlier rejection tasks on both two challenging outdoor and indoor benchmarks (i.e., YFCC100M [13] and SUN3D [15]).

The rest of the paper is organized as follows: We first review the related feature matching literatures in Section 2. Then, we describe the details of the proposed method in Section 3 and present the experimental results in Section 4. Finally, we draw conclusions in Section 5.

Section snippets

Related work

In the section, we briefly introduce the learning-based feature matching methods highly related to our paper. In addition, we also review some related work of attention mechanisms.

Method

In this section, we design an iterative permutation-equivariant network (called PESA-Net) to handle the outlier rejection and geometry estimation problem. In the following, we introduce the problem formulation in Section 3.1, describe the proposed SSEU module in Section 3.2, and discuss our network architecture in Section 3.3.

Experimental results

In the section, we compare the proposed PESA-Net with a de facto standard of handcraft method (i.e., RANSAC [19]) and several state-of-the-art methods, including Point-Net++ [40], LGC-Net [9], DFE-Net [10], OA-Net++ [11] and ACNe-Net [12] on two tasks. Two publicly available datasets (i.e., YFCC100M dataset [13] and SUN3D dataset [15]) are employed both in the camera pose estimation and outlier rejection task. In the following: we first introduce the details of two datasets. After that, we

Conclusion

In this paper, we have designed a novel SSEU module, which is used to build a Permutation-Equivariant Split Attention Network (PESA-Net) for correspondence learning. The proposed SSEU module is able to gather rich contextual information from different aspects in a permutation invariant manner, by generating multiple paths and adopting channel-wise dependence. In addition, we also construct a permutation-equivariant block by fusing the SSEU module, Multi-Layer Perceptron and some normalizations,

CRediT authorship contribution statement

Zhen Zhong: Conceptualization, Methodology, Software, Writing – original draft. Guobao Xiao: Methodology, Writing – reviewing and editing, Supervision. Shiping Wang: Writing – reviewing and editing. Leyi Wei: Visualization, Investigation. Xiaoqin Zhang: Writing – reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 62072223, and supported by the Natural Science Foundation of Fujian Province under Grant 2020J01131199.

References (42)

JiangX. et al.
A review of multimodal image matching: Methods and applications
Inf. Fusion
(2021)
AhmedK.T. et al.
Content based image retrieval using image features information fusion
Inf. Fusion
(2019)
LiuY. et al.
Multi-focus image fusion with dense SIFT
Inf. Fusion
(2015)
YuanX. et al.
Evolution strategies based image registration via feature matching
Inf. Fusion
(2004)
TorrP.H. et al.
MLESAC: A new robust estimator with application to estimating image geometry
Comput. Vis. Image Underst.
(2000)
AhmedK.T. et al.
Content based image retrieval using image features information fusion
Inf. Fusion
(2019)
MaJ. et al.
Image matching from handcrafted to deep features: A survey
Int. J. Comput. Vis.
(2021)
XiaoG. et al.
Deterministic model fitting by local-neighbor preservation and global-residual optimization
IEEE Trans. Image Process.
(2020)
SchonbergerJ.L. et al.
Structure-from-motion revisited
LoweD.G.
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)

Moo YiK. et al.

Learning to find good correspondences

RanftlR. et al.

Deep fundamental matrix estimation

ZhangJ. et al.

Learning two-view correspondences and geometry using order-aware network

SunW. et al.

ACNe: Attentive context normalization for robust permutation-equivariant learning

ThomeeB. et al.

YFCC100M: The new data in multimedia research

(2015)

YingZ. et al.

Hierarchical graph representation learning with differentiable pooling

XiaoJ. et al.

Sun3d: A database of big spaces reconstructed using sfm and object labels

RubleeE. et al.

ORB: An efficient alternative to SIFT or SURF

YiK.M. et al.

Lift: Learned invariant feature transform

DeToneD. et al.

Superpoint: Self-supervised interest point detection and description

FischlerM.A. et al.

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography

Commun. ACM

(1981)

Cited by (13)

Shared contents alignment across multiple granularities for robust SAR-optical image matching
2024, Information Fusion
The matching of SAR and optical images is crucial for various remote sensing applications, such as monitoring natural disasters and change detection. However, the significant differences in geometric and radiometric properties between these two sensors pose challenges for robust and accurate matching. Recent deep learning-based approaches mitigate modality differences by aligning all contents on a single pixel-level feature map, leading to limited robustness to content differences and resolution variations. In this paper, we propose a novel robust SAR-optical matching network to address these challenges. To enhance robustness against noise and resolution changes, we align and match on feature maps of multiple granularities simultaneously. Further, we introduce the novel multi-granularity matching strategy called “look closer to match better” to reduce the computational burden of global matching across multiple granularities. This strategy employs coarse-grained features to quickly narrow down the search range, followed by the use of finer-grained features to gradually locate the finer matching position within a reduced search range, improving both efficiency and performance. Additionally, we address the issue of treating all regions equally during feature alignment by proposing a Non-Shared Contents Filtering (NSCF) module. This module adaptively filters out non-shared regions that are difficult to align, thereby avoiding its interference with the similarity measure of the consistent representation and enhancing robustness to content differences. We evaluate our framework on various satellite datasets. Experiments show that our method achieves the best performance on the SEN1-2 dataset and competitive generalization ability on other unseen satellite datasets.
Multi-sensor fusion fault diagnosis method of wind turbine bearing based on adaptive convergent viewable neural networks
2024, Reliability Engineering and System Safety
Effective condition monitoring and fault diagnosis of rolling bearings, integral components of rotating machinery, are crucial for ensuring equipment reliability. However, existing diagnostic methods based on single signals perform poorly due to the detrimental effects of strong noise. Traditional deep learning approaches often neglect the interdependence between data samples when dealing with rolling bearing faults, thus constraining the accuracy and reliability of fault diagnosis. To tackle these challenges, this study introduces an intelligent diagnostic framework that integrates multi-source information at multiple levels, using acoustic and vibration signals (AVS) data and graph neural networks. Firstly, a data-level fusion method called Correlation Variance Contribution is proposed to effectively integrate vibration signals, addressing the issue of multi-source information integration. An Adaptive Convergent Viewable Graph (AcvGraph) is introduced to optimize the representation of original AVS data and fused vibration signals, improving the capturing of correlation relationships within the data and enhancing classification accuracy. Furthermore, an enhanced DiffPool method is utilized to downsample the graph-structured data, reducing feature dimensions while preserving crucial information. Finally, the framework combines and integrates feature vectors from diverse inputs to form global feature vectors, enabling the accurate classification of rolling bearing faults. Exhaustive experiments validate the effectiveness of the proposed framework in utilizing AVS data for detecting different types of faults. Additionally, rigorous comparisons with alternative intelligent diagnosis techniques substantiate the superiority and advancements of the proposed method.
A unified feature-spatial cycle consistency fusion framework for robust image matching
2023, Information Fusion
Robust image matching is a fundamental and long-standing open problem in computer vision. Conventional wisdom has exploited redundancy to improve the robustness of image matching (e.g., from pairwise to multi-image correspondence), which works well in the spatial domain. Inspired by the success of global optimization-based approaches, we propose a novel extension of cycle consistency from multi-image to multi-descriptor matching in this paper, which integrates useful information from the feature domain. More specifically, we build upon previous work of permutation synchronization and construct a novel cycle consistency model for multi-descriptor matching. The construction of cycle consistency model is based on the analogy between multi-image matching and multi-descriptor matching in a virtual universe. It allows us to formulate multi-image and multi-descriptor matching as a constrained global optimization problem. We have developed a spectral relaxation algorithm to solve this optimization problem, admitting an efficient implementation via fast singular value decomposition (SVD). To demonstrate the robustness of the proposed method named Cycle Consistency Fusion (C2F), we have evaluated it in terms of both raw matching accuracy (pairwise or multi-image) and several higher level downstream tasks such as homography and camera pose estimation. Extensive experimental results have shown that our C2F outperforms state-of-the-art methods consistently across different datasets and vision tasks.
JRA-Net: Joint representation attention network for correspondence learning
2023, Pattern Recognition
Citation Excerpt :
OANet [12] relies on the differential pooling and unpooling operation to extract local context and proposes the Order-Aware Filtering Block to capture global context. PESA-Net [26] adds grouped channel attention blocks after each MLP to collect the information on the channels. T-Net [27] proposes a T-shaped network structure to integrate the information learned in different stages of the network.
In this paper, we propose a Joint Representation Attention Network (JRA-Net), an end-to-end network, to establish reliable correspondences for image pairs. The initial correspondences generated by the local feature descriptor usually suffer from heavy outliers, which makes the network unable to learn a powerful enough representation for distinguishing inliers and outliers. To this end, we design a novel attention mechanism. The proposed attention mechanism not only takes into account the correlations between global context and geometric information, but also introduces the joint representation of different scales to suppress trivial correspondences and highlight crucial correspondences. In addition, to improve the generalization ability of attention mechanism, we present an innovative weight function, to effectively adjust the importance of the attention mechanism in a learning manner. Finally, by combining the above components, the proposed JRA-Net is able to effectively infer the probabilities of correspondences being inliers. Empirical experiments on challenging datasets demonstrate the effectiveness and generalization of JRA-Net. We achieve remarkable improvements compared with the current state-of-the-art approaches on outlier rejection and relative pose estimation.
Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching
2023, Information Fusion
Citation Excerpt :
Furthermore, most of them may sacrifice considerable inliers to estimate the motion parameters, thus limiting their application scenarios [29]. Most recently, Zhong et al. [56] proposed a permutation-equivariant split attention network (PESA-Net) to improve the ability of existing PointNet-like framework in context information learning, by an attention module including split, squeeze, excitation and union operation. Specifically, PESA-Net leverages the concepts of attention mechanism in CNN [57], and further introduces multiple learning paths, thus gathering the channel-wise contextual information and bringing performance gain in two-view correspondences.
As with any data fusion task, the front-end of the pipeline for image fusion, aiming to collect multitudinous physical properties from multimodal images taken by different types of sensors, requires registering the overlapped content of two images via image matching. In other words, the accuracy of image matching will influence directly the subsequent fusion results. In this work, we propose a hybrid correspondence learning architecture, termed as Shape-Former, which is capable of solving matching problems such as multimodal, and multiview cases. Existing attempts have trouble capturing intricate feature interactions for seeking good correspondence, if the image pairs simultaneously suffer from geometric and radiation distortion. To address this, our key is to take advantage of convolutional neural network (CNN) and Transformer for enhancing structure consensus representation ability. Specifically, we introduce a novel ShapeConv so that CNN and Transformer can be generalized to sparse matches learning. Furthermore, we provide a robust soft estimation of outliers mechanism for filtering the response of outliers before capturing shape features. Finally, we also propose coupling multiple consensus representations to further solve the context conflict problems such as local ambiguity. Experiments with variety of datasets reveal that our Shape-Former outperforms state-of-the-art on multimodal image matching, and shows promising generalization ability to different types of image deformations.
BSCA-Net: Bit Slicing Context Attention network for polyp segmentation
2022, Pattern Recognition
Citation Excerpt :
For example, RMA [14] utilizes an attention mechanism to handle the image classification issue. PESA-Net [33] employs the permutation-equivariant split attention mechanism to correspondence learning. Reverse attention [10] proposes a novel attention to guide side-output residual learning for the salient object detection.
In this paper, we propose a novel Bit-Slicing Context Attention Network (BSCA-Net), an end-to-end network, to improve the extraction ability of boundary information for polyp segmentation. The core of BSCA-Net is a new Bit Slice Context Attention (BSCA) module, which exploits the bit-plane slicing information to effectively extract the boundary information between polyps and the surrounding tissue. In addition, we design a novel Split-Squeeze-Bottleneck-Union (SSBU) module, to exploit the geometrical information from different aspects. Also, based on SSBU, we propose an multipath concat attention decoder (MCAD) and an multipath attention concat encoder (MACE), to further improve the network performance for polyp segmentation. Finally, by combining BSCA, SSBU, MCAD and MACE, the proposed BSCA-Net is able to effectively suppress noises in feature maps, and simultaneously improve the ability of feature expression in different levels, for polyp segmentation. Empirical experiments on five benchmark datasets (Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and CVC-300) demonstrate the superior of the proposed BSCA-Net over existing cutting-edge methods.

View all citing articles on Scopus

View full text

Full length articlePESA-Net: Permutation-Equivariant Split Attention Network for correspondence learning

Highlights

Abstract

Introduction

Section snippets

Related work

Method

Experimental results

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Fusion

Inf. Fusion

Inf. Fusion

Inf. Fusion

Comput. Vis. Image Underst.

Inf. Fusion

Image matching from handcrafted to deep features: A survey

Int. J. Comput. Vis.

Deterministic model fitting by local-neighbor preservation and global-residual optimization

IEEE Trans. Image Process.

Structure-from-motion revisited

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Learning to find good correspondences

Deep fundamental matrix estimation

Learning two-view correspondences and geometry using order-aware network

ACNe: Attentive context normalization for robust permutation-equivariant learning

YFCC100M: The new data in multimedia research

Hierarchical graph representation learning with differentiable pooling

Sun3d: A database of big spaces reconstructed using sfm and object labels

ORB: An efficient alternative to SIFT or SURF

Lift: Learned invariant feature transform

Superpoint: Self-supervised interest point detection and description

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography

Commun. ACM

Full length article
PESA-Net: Permutation-Equivariant Split Attention Network for correspondence learning