PixelHop: A successive subspace learning (SSL) method for object recognition

https://doi.org/10.1016/j.jvcir.2019.102749Get rights and content

Abstract

A new machine learning methodology, called successive subspace learning (SSL), is introduced in this work. SSL contains four key ingredients: (1) successive near-to-far neighborhood expansion; (2) unsupervised dimension reduction via subspace approximation; (3) supervised dimension reduction via label-assisted regression (LAG); and (4) feature concatenation and decision making. An image-based object classification method, called PixelHop, is proposed to illustrate the SSL design. It is shown by experimental results that the PixelHop method outperforms the classic CNN model of similar model complexity in three benchmarking datasets (MNIST, Fashion MNIST and CIFAR-10). Although SSL and deep learning (DL) have some high-level concept in common, they are fundamentally different in model formulation, the training process and training complexity. Extensive discussion on the comparison of SSL and DL is made to provide further insights into the potential of SSL.

Introduction

Subspace methods have been widely used in signal/image processing, pattern recognition, computer vision, etc. [[2], [3], [4], [5], [6], [1]]. They can have different names and emphasis in various contexts such as manifold learning [7], [8]. Generally speaking, one uses a subspace to denote the feature space of a certain object class, (e.g., the subspace of the dog object class) or the dominant feature space by dropping less important features (e.g., the subspace obtained via principal component analysis or PCA). The subspace representation offers a powerful tool for signal analysis, modeling and processing. Subspace learning is to find subspace models for concise data representation and accurate decision making based on training samples.

Most existing subspace methods are conducted in a single stage. We may ask whether there is an advantage to perform subspace learning in multiple stages. Research on generalizing from one-stage subspace learning to multi-stage subspace learning is rare. Two PCA stages are cascaded in the PCAnet [9], which provides an empirical solution to multi-stage subspace learning. Little research on this topic may be attributed to the fact that a straightforward cascade of linear multi-stage subspace methods, which can be expressed as the product of a sequence of matrices, is equivalent to a linear one-stage subspace method. The advantage of linear multi-stage subspace methods may not be obvious from this viewpoint.

Yet, multi-stage subspace learning may be worthwhile under the following two conditions. First, the input subspace is not fixed but growing from one stage to the other. For example, we can take the union of a pixel and its eight nearest neighbors to form an input space in the first stage. Afterward, we enlarge the neighborhood of the center pixel from 3×3 to 5×5 in the second stage. Clearly, the first input space is a proper subset of the second input space. By generalizing it to multiple stages, it gives rise to a “successive subspace growing” process. This process exists naturally in the convolutional neural network (CNN) architecture, where the response in a deeper layer has a larger receptive field. In our words, it corresponds to an input of a larger neighborhood. Instead of analyzing these embedded spaces independently, it is advantageous to find a representation of a larger neighborhood using those of its constituent neighborhoods of smaller sizes in computation and storage efficiency. Second, special attention should be paid to the cascade interface of two consecutive stages as elaborated below.

When two consecutive CNN layers are in cascade, a nonlinear activation unit is used to rectify the outputs of convolutional operations of the first layer before they are fed to the second layer. The importance of nonlinear activation to the CNN performance is empirically verified, yet little research is conducted on understanding its actual role. Along this line, it was pointed out in [10] that there exists a sign confusion problem when two CNN layers are in cascade. To address this problem, Kuo et al. proposed the Saak (subspace approximation via augmented kernels) transform [11] and the Saab (subspace approximation via adjusted bias) transform [12] as an alternative to nonlinear activation. Both Saak and Saab transforms are variants of PCA. They are carefully designed to avoid sign confusion.

One advantage of adopting Saak/Saab transforms rather than nonlinear activation is that the CNN system is easier to explain [12]. Specifically, Kuo et al. [12] proposed the use of multi-stage Saab transforms to determine parameters of convolutional layers and the use of multi-stage linear least-squares (LLS) regression to determine parameters of fully-connected (FC) layers. Since all parameters of CNNs are determined in a feedforward manner without any backpropagation (BP) in this design, it is named the feedforward design“. Yet, the feedforward design is drastically different from the BP-based design. Retrospectively, the work in [12] offered the first successive subspace learning (SSL)” design example although the SSL term was not explicitly introduced therein. Although being inspired by the deep learning (DL) framework, SSL is fundamentally different in its model formulation, training process and training complexity. We will conduct an in-depth comparison between DL and SSL in Section 4.

SSL can be applied but not limited to parameters design of a CNN. In this work, we will examine the feedforward design as well as SSL from a higher ground. Our current study is a sequel of cumulative research efforts as presented in [10], [11], [12], [13]. Here, we introduce SSL formally and discuss its similarities and differences with DL. To illustrate the flexibility and generalizability of SSL, we present an SSL-based machine learning system for object classification. It is called the PixelHop method. The block diagram of the PixelHop method deviates from the standard CNN architecture completely since it is not a network any longer. The word “hop'” is borrowed from graph theory. For a target node in a graph, its immediate neighboring nodes connected by an edge are called its one-hop neighbors. Its neighboring nodes connected to itself through n consecutive edges via the shortest path are the n-hop neighbors. The PixelHop method begins with a very localized region; namely, a single pixel denoted by p. It is called the 0-hop input. We concatenate the attributes of a pixel, and attributes of its one-hop neighbors to form a one-hop neighborhood denoted by R1(p). We can keep enlarging the input by including larger neighborhood regions. This idea applies to structured data (e.g., images) as well as unstructured data (e.g., 3D point cloud sets). An SSL-based 3D point cloud classification scheme, called the PointHop method, was proposed in [14].

If we implement the above idea in a straightforward manner, the dimension of neighborhood Ri(p), where i=1,2,,I is the stage index, will grow very fast as i becomes larger. To control the rapid dimension growth of Ri(p), we use the Saab transform to reduce its dimension. Since no label is used in the Saab transform, it is an unsupervised dimension reduction technique. To reduce the dimension of the Saab responses at each stage furthermore, we exploit the label of training samples to perform supervised dimension reduction, which is implemented by a label-assisted regression (LAG) unit. As a whole, the PixelHop method provides an extremely rich feature set by integrating attributes from near-to-far neighborhoods of selected spatial locations. Finally, we adopt an ensemble method to combine features and train a classifier, such as the support vector machine (SVM) [15] and the random forest (RF) [16], to provide the ultimate classification result. Extensive experiments are conducted on three datasets (namely, MNIST, Fashion MNIST and CIFAR-10 datasets) to evaluate the performance of the PixelHop method. It is shown by experimental results that the PixelHop method outperforms classical CNNs of similar model complexity in classification accuracy while demanding much lower training complexity.

Our current work has three major contributions. First, we introduce the SSL notion explicitly and make a thorough comparison between SSL and DL. Second, the LAG unit using soft pseudo labels as presented in Section 2.3 is novel. Third, we use the PixelHop method as an illustrative example for SSL, and conduct extensive experiments to demonstrate its performance.

The rest of this paper is organized as follows. The PixelHop method is presented in Section 2. Experimental results of the PixelHop method are given in Section 3. Comparison between DL and SSL is discussed in Section 4. Finally, concluding remarks are drawn and future research topics are pointed out in Section 5.

Section snippets

PixelHop method

We present the PixelHop method to illustrate the SSL methodology for image-based object classification in this section. First, we give an overview of the whole system in Section 2.1. Then, we study the properties of Saab filters that reside in each PixelHop unit in Section 2.2. Finally, we examine the label-assisted regression (LAG) unit of the PixelHop method in Section 2.3.

Experimental results

We organize experimental results in this section as follows. First, we discuss the experimental setup in Section 3.1. Second, we conduct the ablation study and study the effects of different parameters on the Fashion MNIST dataset in Section 3.2. Third, we perform error analysis, compare the performance of color image classification using different color spaces and show the scalability of the PixelHop method in Section 3.3. Finally, we conduct performance benchmarking between the PixelHop

Discussion

In this section, we first summarize the key ingredients of SSL in Section 4.1. Then, extensive discussion on the comparison of SSL and DL is made in Section 4.2 to provide further insights into the potential of SSL.

Conclusion and future work

A successive subspace learning (SSL) methodology was introduced and the PixelHop method was proposed in this work. In contrast with traditional subspace methods, SSL examines the near- and far-neighborhoods of a set of selected pixels. It uses the training data to learn three sets of parameters: (1) Saab filters for unsupervised dimension reduction in the PixelHop unit, (2) regression matrices for supervised dimension reduction in the LAG unit, and (3) parameters required by the classifier.

CRediT authorship contribution statement

Yueru Chen: Conceptualization, Methodology, Software, Writing - original draft. C.-C. Jay Kuo: Methodology, Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research is supported in part by DARPA and Air Force Research Laboratory (AFRL) under agreement number FA8750-16-2-0173 and in part by the U.S. Army Research Laboratory's External Collaboration Initiative (ECI) of the Director's Research Initiative (DRIA) program. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained in this document are those of the authors and

References (21)

There are more references available in the full text version of this article.

Cited by (76)

  • Subspace learning machine (SLM): Methodology and performance evaluation

    2024, Journal of Visual Communication and Image Representation
  • Design of supervision-scalable learning systems: Methodology and performance benchmarking

    2023, Journal of Visual Communication and Image Representation
  • Green learning: Introduction, examples and outlook

    2023, Journal of Visual Communication and Image Representation
    Citation Excerpt :

    GL addresses the following questions to make the learning process both efficient and effective: New, powerful tools have been developed to address each of these questions in the last several years, e.g., the Saak [88] and Saab transforms [89] for question 1, PixelHop [23], PixelHop++ [25] and IPHop [188] learning systems for question 2, the discriminant and relevant feature tests [194] for question 3, and the subspace learning machine [50] for question 4. The original ideas in these papers are scattered in different publications over the years.

  • GSIP: Green Semantic Segmentation of Large-Scale Indoor Point Clouds

    2022, Pattern Recognition Letters
    Citation Excerpt :

    Green point cloud learning has been introduced in a sequence of recent publications [15,16,36–38]. Its theoretical foundation is successive subspace learning (SSL) [7–9,17–19,34], which applies to 2D images as well as 3D point clouds. SSL helps reduce the model size and computation complexity, and offers mathematical transparency.

  • A privacy preservation framework for feedforward-designed convolutional neural networks

    2022, Neural Networks
    Citation Excerpt :

    Compared with the conventional BP-CNNs, FF-CNNs have the advantages of low training complexity and high model interpretability. Currently, FF-CNNs can achieve performances comparable to those of BP-CNNs in the ensemble learning (Chen, Yang, Wang and Kuo, 2019), semi-supervised learning (Chen, Yang, Zhang and Kuo, 2019), point cloud classification (Kadam, Zhang, Liu, & Kuo, 2021; Zhang, Wang, Kadam, Liu and Kuo, 2020; Zhang, You et al., 2020), successive subspace learning (Chen & Kuo, 2020; Liu, et al., 2021), and Deepfake detection (Chen, et al., 2021) applications. The model training cannot be done without data support.

  • Geometrical Interpretation and Design of Multilayer Perceptrons

    2024, IEEE Transactions on Neural Networks and Learning Systems
View all citing articles on Scopus

This paper has been recommended for acceptance by Zicheng Liu.

View full text