Elsevier

Information Fusion

Volume 77, January 2022, Pages 81-89
Information Fusion

Full length article
PESA-Net: Permutation-Equivariant Split Attention Network for correspondence learning

https://doi.org/10.1016/j.inffus.2021.07.018Get rights and content

Highlights

  • We propose an iterative network to establish reliable correspondences.

  • We develop an attention mechanism to capture rich contextual information.

  • We construct a novel block to exploit the global context of unordered data.

  • The proposed network achieves the state-of-the-art performance.

Abstract

Establishing reliable correspondences by a deep neural network is an important task in computer vision, and it generally requires permutation-equivariant architecture and rich contextual information. In this paper, we design a Permutation-Equivariant Split Attention Network (called PESA-Net), to gather rich contextual information for the feature matching task. Specifically, we propose a novel “Split–Squeeze–Excitation–Union” (SSEU) module. The SSEU module not only generates multiple paths to exploit the geometrical context of putative correspondences from different aspects, but also adaptively captures channel-wise global information by explicitly modeling the interdependencies between the channels of features. In addition, we further construct a block by fusing the SSEU module, Multi-Layer Perceptron and some normalizations. The proposed PESA-Net is able to effectively infer the probabilities of correspondences being inliers or outliers and simultaneously recover the relative pose by essential matrix. Experimental results demonstrate that the proposed PESA-Net relative surpasses state-of-the-art approaches for pose estimation and outlier rejection on both outdoor scenes and indoor scenes (i.e., YFCC100M and SUN3D). Source codes: https://github.com/x-gb/PESA-Net.

Introduction

Feature matching is a fundamental and important problem for a variety of applications in computer vision [1], [2], such as Image Retrieval [3], Image Fusion [4], Image Registration [5] and Structure from Motion (SfM) [6], [7].

Given two images of the same or similar scenes, the aim of feature matching is to establish reliable feature correspondences. Note that matching N feature points to another N feature points may require solving an NP-hard assignment problem. To deal with the complex problem, a common strategy of feature matching is typically solved in a two-step manner, that is, generating a set of putative correspondences by picking out point pairs with sufficiently similar feature descriptors and establishing reliable correspondences from the generated putative ones. For the first step, the putative correspondences are usually extracted by a robust extractor, such as scale invariant feature transform (SIFT) [8]. However, the brute-force putative correspondences often contain a large number of false matches (i.e., outliers), due to the low-quality images and the constraint of local descriptor information. Thus, it is critical to design a robust approach, for establishing reliable correspondences in the second step.

Recently, learning-based methods, e.g., LGC-Net [9], DFE-Net [10], OA-Net [11] and ACNe-Net [12], have been extensively proposed for feature matching, due to the excellent performance of deep neural network. However, LGC-Net, DFE-Net and OA-Net, rely on PointCN, a PointNet-like architecture with Context Normalization, which normalizes the feature maps according mean and variance. Therefore, Context Normalization can be expressed as the solution of a least-squares problem which is not robust to outliers. To deal with the problem, ACNe-Net is proposed to capture the context information in both global and local manner, by a normalization operation. However, the normalization operation neglects channel-wise correspondence contextual information, which may lead to sub-optimal performance for feature matching.

In this paper, we propose a novel attention mechanism called “Split–Squeeze–Excitation–Union” (SSEU) module, which extracts the contextual information in a channel-wise manner, to improve the matching performance. Comparing with other state-of-the-art approaches on YFCC100M unknown scenes [13], our network introduces very few additional parameters and negligible computations while bringing notable performance gain, as shown in Fig. 1. Specifically, the SSEU module consists of four operations: Split, Squeeze, Excitation and Union, to gather the channel-wise global information from different aspects for feature matching. The Split operation generates multiple paths to exploit the geometrical context of putative correspondences from different aspects. The Squeeze operation aggregates feature-maps to produce a channel descriptor. The Excitation operation adopts the channel dependence to learn for each channel by a Multi-Layer Perceptron (MLP), to achieve the excitation of each channel. The Union operation combines and aggregates the geometrical context information from multiple paths. Note that, the SSEU module not only uses a channel-wise manner, but also includes a local and global manner.

To handle the unordered correspondence features, we follow the existing learning-based feature matching methods to build the network based on a Multi-Layer Perceptron (MLP), which is able to provide permutation equivariance, which is not feasible with neither convolutional nor fully-connected [12]. Then, we construct a Permutation-Equivariant Split Attention (PESA) block, which is fused by the MLP, SSEU module, and some normalizations. After that, by stacking the PSEA blocks together, we build our network called PESA-Net. We show the overview of our PESA-Net in Fig. 2. Note that we add Context Normalization after each MLP to enrich contextual information. In addition, we also insert the Geometric Attention Block, which contains a Differentiable Pooling Layer [14], Order Aware Filtering Block, and Differentiable Unpooling Layer [11], in the middle of each iterative sub-network to extract the local information and global information of correspondences due to the effective performance.

We summarize the contributions as follows:

  • We develop a simple and effective attention mechanism, named “Split–Squeeze–Excitation–Union” (SSEU) module, which generates multiple paths and adopts channel-wise dependence to capture rich contextual information from different aspects in a permutation invariant manner. To the best of our knowledge, we are the first one to introduce the split-attention mechanism to handle feature matching problems.

  • We construct a permutation-equivariant block, which consists of the SSEU module, Multi-Layer Perceptron and some normalizations, to exploit the complex global context of sparse and unordered correspondence data. In addition, we also design an iterative permutation-equivariant network by stacking the PESA block and Geometric Attention block together, for feature matching.

  • The proposed PESA-Net achieves the state-of-the-art performance on relative pose estimation and outlier rejection tasks on both two challenging outdoor and indoor benchmarks (i.e., YFCC100M [13] and SUN3D [15]).

The rest of the paper is organized as follows: We first review the related feature matching literatures in Section 2. Then, we describe the details of the proposed method in Section 3 and present the experimental results in Section 4. Finally, we draw conclusions in Section 5.

Section snippets

Related work

In the section, we briefly introduce the learning-based feature matching methods highly related to our paper. In addition, we also review some related work of attention mechanisms.

Method

In this section, we design an iterative permutation-equivariant network (called PESA-Net) to handle the outlier rejection and geometry estimation problem. In the following, we introduce the problem formulation in Section 3.1, describe the proposed SSEU module in Section 3.2, and discuss our network architecture in Section 3.3.

Experimental results

In the section, we compare the proposed PESA-Net with a de facto standard of handcraft method (i.e., RANSAC [19]) and several state-of-the-art methods, including Point-Net++ [40], LGC-Net [9], DFE-Net [10], OA-Net++ [11] and ACNe-Net [12] on two tasks. Two publicly available datasets (i.e., YFCC100M dataset [13] and SUN3D dataset [15]) are employed both in the camera pose estimation and outlier rejection task. In the following: we first introduce the details of two datasets. After that, we

Conclusion

In this paper, we have designed a novel SSEU module, which is used to build a Permutation-Equivariant Split Attention Network (PESA-Net) for correspondence learning. The proposed SSEU module is able to gather rich contextual information from different aspects in a permutation invariant manner, by generating multiple paths and adopting channel-wise dependence. In addition, we also construct a permutation-equivariant block by fusing the SSEU module, Multi-Layer Perceptron and some normalizations,

CRediT authorship contribution statement

Zhen Zhong: Conceptualization, Methodology, Software, Writing – original draft. Guobao Xiao: Methodology, Writing – reviewing and editing, Supervision. Shiping Wang: Writing – reviewing and editing. Leyi Wei: Visualization, Investigation. Xiaoqin Zhang: Writing – reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 62072223, and supported by the Natural Science Foundation of Fujian Province under Grant 2020J01131199.

References (42)

  • Moo YiK. et al.

    Learning to find good correspondences

  • RanftlR. et al.

    Deep fundamental matrix estimation

  • ZhangJ. et al.

    Learning two-view correspondences and geometry using order-aware network

  • SunW. et al.

    ACNe: Attentive context normalization for robust permutation-equivariant learning

  • ThomeeB. et al.

    YFCC100M: The new data in multimedia research

    (2015)
  • YingZ. et al.

    Hierarchical graph representation learning with differentiable pooling

  • XiaoJ. et al.

    Sun3d: A database of big spaces reconstructed using sfm and object labels

  • RubleeE. et al.

    ORB: An efficient alternative to SIFT or SURF

  • YiK.M. et al.

    Lift: Learned invariant feature transform

  • DeToneD. et al.

    Superpoint: Self-supervised interest point detection and description

  • FischlerM.A. et al.

    Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • Cited by (13)

    • JRA-Net: Joint representation attention network for correspondence learning

      2023, Pattern Recognition
      Citation Excerpt :

      OANet [12] relies on the differential pooling and unpooling operation to extract local context and proposes the Order-Aware Filtering Block to capture global context. PESA-Net [26] adds grouped channel attention blocks after each MLP to collect the information on the channels. T-Net [27] proposes a T-shaped network structure to integrate the information learned in different stages of the network.

    • Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching

      2023, Information Fusion
      Citation Excerpt :

      Furthermore, most of them may sacrifice considerable inliers to estimate the motion parameters, thus limiting their application scenarios [29]. Most recently, Zhong et al. [56] proposed a permutation-equivariant split attention network (PESA-Net) to improve the ability of existing PointNet-like framework in context information learning, by an attention module including split, squeeze, excitation and union operation. Specifically, PESA-Net leverages the concepts of attention mechanism in CNN [57], and further introduces multiple learning paths, thus gathering the channel-wise contextual information and bringing performance gain in two-view correspondences.

    • BSCA-Net: Bit Slicing Context Attention network for polyp segmentation

      2022, Pattern Recognition
      Citation Excerpt :

      For example, RMA [14] utilizes an attention mechanism to handle the image classification issue. PESA-Net [33] employs the permutation-equivariant split attention mechanism to correspondence learning. Reverse attention [10] proposes a novel attention to guide side-output residual learning for the salient object detection.

    View all citing articles on Scopus
    View full text