Adaptive discriminant analysis for semi-supervised feature selection

doi:10.1016/j.ins.2021.02.035

Information Sciences

Volume 566, August 2021, Pages 178-194

https://doi.org/10.1016/j.ins.2021.02.035 Get rights and content

Abstract

As semi-supervised feature selection is becoming much more popular among researchers, many related methods have been proposed in recent years. However, many of these methods first compute a similarity matrix prior to feature selection, and the matrix is then fixed during the subsequent feature selection process. Clearly, the similarity matrix generated from the original dataset is susceptible to the noise features. In this paper, we propose a novel adaptive discriminant analysis for semi-supervised feature selection, namely, SADA. Instead of computing a similarity matrix first, SADA simultaneously learns an adaptive similarity matrix S and a projection matrix W with an iterative process. Moreover. we introduce the $ℓ_{2, p}$ norm to control the sparsity of S by adjusting p. Experimental results show that S will become sparser with the decrease of p. The experimental results for synthetic datasets and nine benchmark datasets demonstrate the superiority of SADA, in comparison with 6 semi-supervised feature selection methods.

Introduction

Feature selection is very important for high-dimensional data analysis because it can remove irrelevant features with slight performance deterioration [1]. With the rapid increase of the data size, obtaining labeled data is often costly [2]. Therefore, to free us from the laborious and tedious data labeling work, only a small set of data samples are expected to be marked with ground truth. At the same time, it is desirable to exploit unlabeled samples during training to ensure the effectiveness of the learned models. The research topics related to this problem such as image annotations and categorizations have become hot spots in many machine learning fields [3], [4]. Thus, it is desirable to develop feature selection methods that can exploit both labeled and unlabeled data. Therefore, the study of “semi-supervised feature selection” has gained increasing attention [5], [6], [7].

Due to the advantages of semi-supervised feature selection, related methods have sprung up in recent years. However, these methods have a shortcoming of measuring features with a ranking criterion without considering the models [8], [9], [10], [11]. Ren et al. proposed a wrapper-type forward semi-supervised feature selection framework [12] that exploits labeled data and unlabeled data for the supervised sequential forward semi-supervised feature selection (SFFS). Xu et al. introduced a discriminative semi-supervised feature selection method based on the idea of manifold regularization, making use of classification margin and geometry of the probability distribution to select the features. However, because of its computational complexity of $O (n^{2.5} / ∊)$ , where n is the number of objects and $∊$ is a fairly small stopping criterion, their method is time-consuming [13]. To choose the “best so far” feature subset from the streaming features, Wu et al. proposed a novel feature selection method called online streaming feature selection (OSFS) [14]. However, domain knowledge is required in OSFS, and thus, Eskandari et al. chose to use rough sets (RS) to optimize the former model (OS-NRRSAR-SA) [15]. Recently, Zhou et al. further improved the new OS-NRRSAR-SA model with the adapted neighborhood rough set [16]. Chen et al. have also used the rough set to perform feature selection on imbalanced data [17]. Embedded semi-supervised methods are superior to other feature selection methods in many ways because feature selection is set as part of the model training process. Chen proposed a semi-supervised feature selection method RLSR [5] in which a rescaled linear square regression is added to extend the least-squares regression for feature selection. Yuan et al. improved RLSR by introducing an $∊$ -dragging technique to enlarge the distances between different classes [18]. In addition to the methods mentioned above, other methods such as semi-supervised feature selection via spline regression [19], ensemble feature selection [20], [21], [22], and parallel feature selection [23], [24] have also been developed. For most feature selection methods, a pair-wise similarity matrix is constructed from the original data and then set as fixed during the subsequent feature selection process. Researchers have designed several metrics to quantify the similarity between features based on some powerful tools like mutual information and graphs [25], [26]. However, as pointed in [27], such a similarity matrix may lose the inner class structure on the multimodal data in which samples in some classes form several separate clusters, and often mislead the feature selection methods into recovering the wrong local structure since it is easily affected by noise features.

Sparsity commonly exists in real-world data, thus, sparse learning has become a key component in feature selection. Shi et al. consider the superiority of $l_{2, p}$ -norm as well as its non-Lipschitz continuity and claim the effectiveness of $l_{2, 1 - 2}$ -norm. Moreover, they apply CCCP and ADMM to solve the non-convex problem [28]. Zhang et al. draw the conclusion that for $l_{2, p}$ -norm, a smaller p leads to higher performance. They also discuss the situation where $p \to 0$ and proposed two algorithms to optimize the discrete feature selection problem [29].

Recently, Chen et al. have proposed a LAP framework for both labeled and unlabeled data [30]. Instead of computing a fixed similarity matrix prior to performing feature selection, LAP learns an adaptive similarity matrix $S$ and a projection matrix $W$ simultaneously with an iterative method. Based on this method, we extend it to semi-supervised feature selection tasks by proposing a semi-supervised adaptive discriminant analysis (SADA). As an extension to LAP, SADA can learn better a similarity matrix by weakening the effect of noisy features on the similarity computing and deal with multimodal data [27] by investigating local structure in the data. The main contributions of our work include:

1.
We rewrite $| | W^{T} (x_{i} - x_{j}) | |_{2}$ as $| | W^{T} (x_{i} - x_{j}) | |_{2, p}^{p}$ by introducing the $ℓ_{2, p}$ norm to control the sparsity of S by adjusting p, which can be used to adaptively preserve the locality. Experimental results show that S will become sparser with the decrease of p.
2.
We take both labeled and unlabeled data into account to better preserve the locality in the semi-supervised scenario.
3.
Comprehensive experiments on 9 benchmark datasets show the superior performance of the proposed approach in comparison with 6 semi-supervised feature selection methods.

The rest of this paper is organized as follows. Section 2 presents the notations, and Section 3 surveys the existing semi-supervised feature selection methods. The semi-supervised feature selection method, SADA, is proposed in Section 4. We present the experimental results and analysis in Section 5. The conclusions and directions for future work are provided in Section 6.

Section snippets

Notations and definitions

We now summarize the notations and the definition of the norms used in this paper. Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For matrix $M = (m_{ij})$ , its i-th row is denoted as $m^{i}$ , and its j-th column is denoted by $m_{j}$ . The Frobenius norm of the matrix $M \in R^{n \times m}$ is defined as $‖ M ‖_{F} = \sqrt{\sum_{i = 1}^{n} \sum_{j = 1}^{m} m_{ij}^{2}}$ . The $ℓ_{2, 1}$ -norm of matrix $M \in R^{n \times m}$ is defined as $‖ M ‖_{2, p} = {(\sum_{i = 1}^{n} {(\sum_{j = 1}^{m} m_{ij}^{2})}^{\frac{p}{2}})}^{\frac{1}{p}}$ .

Semi-supervised feature selection

The early semi-supervised feature selection methods are filter-based which score the features with a ranking criterion regardless of the model [8], [9], [10], [11]. For example, Zhao et al. proposed a semi-supervised feature selection algorithm named sSelect based on spectral analysis [8]. Consider a dataset $X \in R^{d \times n}$ consisting of two subsets: a set of l labeled objects $X_{L} = (x_{1}, \dots, x_{l})$ that are associated with class labels $y_{L} \in R^{l}$ and a set of $u = n - l$ unlabeled objects $X_{U} = {(x_{l + 1}, \dots, x_{l + u})}^{T}$ for which the

Proposed method

In this section, we propose the new feature selection method conducted in a semi-supervised manner.

Experiments on synthetic datasets

We generated a synthetic dataset $D_{1}$ to test the projection ability of the proposed method for feature selection. $D_{1}$ consists of 12 dimensions, where the data in the first two dimensions are distributed in three Gaussian shapes, while the data in the other dimensions are uniformly distributed noise features. Fig. 2a shows the dataset in the first two dimensions in which two small Gaussian clusters are buried in one class. In this experiment, our goal is to find a good projection direction that

Conclusions

In this paper, we have proposed a novel semi-supervised feature selection method, named SADA, that performs feature selection and implicit adaptive local structure learning simultaneously. The new method simultaneously learns a projection matrix $W$ and an implicit adaptive similarity matrix $S$ from both labeled and unlabeled data. In the new objective function, the $ℓ_{2, p}$ norm is imposed on the pair-wise projected distances and experimental results show that learned implicit similarity matrix S

CRediT authorship contribution statement

Weichan Zhong: Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft. Xiaojun Chen: Conceptualization, Methodology, Validation, Investigation, Writing - original draft. Feiping Nie: Conceptualization, Project administration, Writing - original draft. Joshua Zhexue Huang: Project administration, Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by Major Project of the New Generation of Artificial Intelligence (No. 2018AAA0102900), NSFC under Grant No. 61773268 Natural Science Foundation of SZU (Grant No. 000346) and the Shenzhen Research Foundation for Basic Research, China (Nos. JCYJ20180305124149387).

Weichan Zhong is a student for Master degree of the College of Computer Science and Software, Shenzhen University, Shenzhen, China. Her current research interests include clustering and feature selection.

References (32)

R. Zhu et al.
Learning a discriminant graph-based embedding with feature selection for image categorization
Neural Networks
(2019)
C. Shi et al.
Multi-view adaptive semi-supervised feature selection with the self-paced learning
Signal Process.
(2020)
G. Doquire et al.
A graph laplacian based approach to semi-supervised feature selection for regression problems
Neurocomputing
(2013)
S. Eskandari et al.
Online streaming feature selection using rough sets
Int. J. Approximate Reasoning
(2016)
P. Zhou et al.
Online streaming feature selection using adapted neighborhood rough set
Inf. Sci.
(2019)
H. Chen et al.
Feature selection for imbalanced data based on neighborhood rough sets
Inf. Sci.
(2019)
P. Drotár et al.
Ensemble feature selection using election methods and ranker clustering
Inf. Sci.
(2019)
L. Venkataramana et al.
A parallel multilevel feature selection algorithm for improved cancer classification
J. Parallel Distrib. Comput.
(2020)
L. Zheng et al.
Feature grouping and selection: a graph-based approach
Inf. Sci.
(2021)
H. Wang et al.
A factor graph model for unsupervised feature selection
Inf. Sci.
(2019)

S.H. Huang

Supervised feature selection: a tutorial

Artif. Intell. Res.

(2015)

Y. Luo et al.

Vector-valued multi-view semi-supervised learning for multi-label image classification

C. Tang et al.

Adaptive hypergraph embedded semi-supervised multi-label image annotation

IEEE Trans. Multimedia

(2019)

X. Chen, G. Yuan, F. Nie, J.Z. Huang, Semi-supervised feature selection via rescaled linear regression, in:...

J. Li, X. Liang, P. Li, W. Zhang, Q. Du, H. Yuan, Two-dimensional semi-supervised feature selection, in: 2020 10th...

Z. Zhao et al.

Semi-supervised feature selection via spectral analysis, in

Cited by (27)

Adaptive orthogonal semi-supervised feature selection with reliable label matrix learning
2024, Information Processing and Management
Feature selection (FS) can select features with high value from high-dimensional data as much as possible and reduce the dimensionality of the data to improve the performance of the machine learning model and enhance the generalization ability. In semi-supervised FS (SSFS), regression-based methods utilize the label matrix, and the quality of the label matrix can directly affect the performance of FS. Hence, the importance of the reliable label matrix is obvious. Orthogonal regression (OR) can retain more information in the subspace than least squares regression (LSR). Therefore, this paper introduces OR into SSFS, and then the label scaling technique is used to learn a reliable label matrix. Also, this study utilizes adaptive graph learning to exploit more structural information about the data. Two constraints, Frobenius-norm or maximum information entropy imposed on the similarity matrix and two adaptive orthogonal SSFS (AGLSOFS) methods with reliable label matrix learning are constructed. The impact of these two constraints on the construction of dynamic similarity graphs and FS results is discussed. Effective optimization algorithms for these two methods are based on the Alternating Direction Method of Multipliers (ADMM) and Generalized Power Iteration (GPI). Experiments are conducted on 15 benchmark datasets, and the results show that: (1) similarity graphs constructed using both original and projected data are more accurate; (2) both constraints are valid; (3) both of the methods of this paper perform well on most datasets; (4) OR performs better than LSR for FS; and (5) the scaling factor affects the convergence speed of the model.
Efficient multi-view semi-supervised feature selection
2023, Information Sciences
Multi-view semi-supervised feature selection can identify a feature subset from heterogeneous feature spaces of data. However, existing methods fail in handling large-scale data since they have to calculate the inverses of high-order dense matrices. Moreover, traditional methods often pre-construct graphs to mine the similarity structure of data, such that the interaction between graph construction and feature selection is directly ignored, degrading their effectiveness in practice. To address these issues, we propose an efficient multi-view feature selection method (EMSFS), which combines graph learning, label propagation as well as multi-view feature selection within a unified framework. Specifically, EMSFS can adaptively learn a bipartite graph between training samples and generated anchors, not only reducing the cost of graph computation but also tactfully avoiding the inverse of a high-order matrix. As a result, the main computational complexity of EMSFS is approximately linear to the number of training samples. Meanwhile, EMSFS simultaneously selects important features and exploits the similarity structure in the projected feature space, which enhances the reliability of the graph and positively facilitates feature selection. To solve the formulated objective function, we developed an alternating optimization, and experiments validate the effectiveness and the efficiency of EMSFS.
Simultaneous local clustering and unsupervised feature selection via strong space constraint
2023, Pattern Recognition
Clustering is a fashion method applied in machine learning tasks. However, high dimensional data brings many obstacles for clustering approaches. To address such a problem, the unsupervised feature selection (UFS) method can be incorporated into clustering to reduce dimensionality. In general, most of the UFS methods adopt $ℓ_{2, 1}$ -norm for subspace sparsity learning. However, its sparsity highly relies on the setting of trade-off parameter, which may lead to instability of ranking results and the difficulty in obtaining the optimal solution of projection matrix. In this paper, we propose to directly learn an absolutely row-sparsity subspace via the $ℓ_{2, 0}$ -norm constraint, called Sparse constraint and Local learning for Unsupervised Feature Selection (SLUFS). It is an ideal sparse subspace constraint which can overcome the drawbacks of the $ℓ_{2, 1}$ -norm. However, optimizing the $ℓ_{2, 0}$ -norm constraint is an NP-hard problem, and at present, only some approximate solutions can be given, but the convergence can not be guaranteed. To tackle this challenge, we design a novel alternative iterative algorithm to directly optimize the $ℓ_{2, 0}$ -norm based model. Most importantly, our strategy can obtain a closed-form solution with strict convergence guarantee. Comprehensive experiments are conducted on several real-world datasets to evaluate the performance of SLUFS with comparison to several related state-of-the-art methods.
Graph adaptive semi-supervised discriminative subspace learning for EEG emotion recognition
2023, Journal of King Saud University - Computer and Information Sciences
Since Electroencephalogram (EEG) is resistant to camouflage and contains abundant neurophysiological information, it shows significant superiorities in objective emotion recognition, making EEG-based emotion recognition become a hot research field in brain-computer interface research. However, EEG is generally non-stationary and has a low signal-to-noise ratio, which is difficult to analyze. Inspired by the consensus that exploring a discriminative subspace representation usually helps to capture the semantic information of EEG data, in this paper we propose a Graph Adaptive Semi-supervised Discriminative Subspace Learning (GASDSL) model for EEG-based emotion recognition. GASDSL aims to explore a discriminative subspace in which the intra-class scatter decreases while the inter-class separability increases. The adaptive maximum entropy graph construction and semi-supervised subspace emotional state prediction are adopted to mediate the discriminative subspace learning. Extensive comparative studies on the SEED-IV and SEED-V datasets depict that 1) GASDSL achieved satisfactory emotion recognition accuracy compared with other semi-supervised learning models, 2) the discriminative abilities of both the learned maximum entropy graph and subspace are improved as the model iterates, and 3) the features extracted from the Gamma band, the left/right temporal, prefrontal, and (central) parietal lobes contributed more to emotion recognition based on the spatial-frequency pattern analysis results.
Semi-supervised feature selection via adaptive structure learning and constrained graph learning
2022, Knowledge-Based Systems
Citation Excerpt :
AGLGSFS [51]: It is a semi-supervised model which achieves adaptive feature analysis by integrating group sparsity constraint and adaptive local manifold learning into a unified FS framework. SADA [52]: This method utilizes discriminant analysis techniques to extend the LAP [53] method. In this method, both the label information and projection distance are used to adaptively construct the similarity matrix.
Graph-based sparse feature selection plays an important role in semi-supervised feature selection, which greatly improves the performance of feature selection. However, most existing semi-supervised methods based on graph are still limited in two main aspects. On the one hand, the quality of the similarity matrix will affect the performance of the learning model. Adaptive graph learning improves the quality of similarity matrix by learning the similarity matrix adaptively. However, most methods based on adaptive graph learning ignore the label information, which may limit the quality of the similarity matrix. On the other hand, many state-of-the-art methods only consider the local structure and neglect the global structure of samples, which will result in high redundancy in the selected features. To alleviate the impact of the above problems, in this study, a novel semi-supervised feature selection model named ASLCGLFS is proposed. In the proposed method, the adaptive graph learning is extended through label information, which aims to further improve the quality of the similarity matrix by utilizing the label information to constrain the graph learning. Moreover, adaptive structure learning is introduced, which not only considers the global structure but also facilitates feature selection. An iteration method is designed to solve the objective function and the convergence of this method is proved theoretically and experimentally. Extensive experiments conducted on common datasets verify that the proposed ASLCGLFS is better than some state-of-the-art feature selection algorithms in performance.
Label correlations variation for robust multi-label feature selection
2022, Information Sciences
Citation Excerpt :
Filter models and wrapper models are independent of or dependent on the subsequent learning algorithms respectively [9]. Embedded models fuse the feature selection process and learning algorithms into one learning framework so that feature selection can be achieved by optimizing the learning framework [10]. Embedded model is the research object in this paper because of high execution efficiency and excellent classification performance.
Numerous high-dimension multi-label data are produced, leading to the imperative need to design excellent multi-label feature selection methods. It is of paramount importance to exploit label correlations in previous methods. However, there exist two unsolved issues in most of existing methods. First, most of existing methods explore label correlations based on the original label space with redundant and irrelevant label information. Second, previous methods either consider second-order label correlations or high-order label correlations, in fact, both two types of label correlations are significant and complementary for capturing label information. To this end, this paper proposes a robust multi-label feature selection method with both two types of label correlations. To start with, we introduce the self-expression model to consider the high-order label correlations, additionally, the $l_{2, 1}$ -norm is imposed onto the self-expression coefficient matrix to eliminate redundant and noisy information. Furthermore, we employ a label-level regularizer to achieve pairwise label correlations. Finally, an optimization scheme with convergence proof is designed to deal with the objective function. Multiple experimental analysis results on fourteen multi-label data sets manifest that the classification performance of the proposed method outperforms other baselines.

View all citing articles on Scopus

Xiaojun Chen (M’16) received a Ph.D. degree from the Harbin Institute of Technology, Harbin, China, in 2011. He is currently an Associate Professor of the College of Computer Science and Software, Shenzhen University, Shenzhen, China. His current research interests include subspace clustering, topic model, feature selection, and massive data mining.

Feiping Nie received a Ph.D. degree in Computer Science from Tsinghua University, China, in 2009. His research interests are machine learning and its applications such as pattern recognition, data mining, computer vision, image processing and information retrieval. He has published more than 100 papers in the following top journals and conferences: TPAMI, IJCV, TIP, TNNLS/TNN, TKDE, TKDD, Bioinformatics, ICML, NIPS, KDD, IJCAI, AAAI et al. His papers have been cited more than 5000 times (Google scholar). He is now serving as Associate Editor or PC member for several prestigious journals and conferences in the related fields.

Joshua Zhexue Huang received a Ph.D. degree from the Royal Institute of Technology, Stockholm, Sweden. He is currently a Professor with the College of Computer Science and Software, Shenzhen University, Shenzhen, China, a Professor and a Chief Scientist of the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, China, and an Honorary Professor with the Department of Mathematics, The University of Hong Kong, Hong Kong. His current research interests include data mining, machine learning, and clustering algorithms.

View full text

Adaptive discriminant analysis for semi-supervised feature selection

Abstract

Introduction

Section snippets

Notations and definitions

Semi-supervised feature selection

Proposed method

Experiments on synthetic datasets

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neural Networks

Signal Process.

Neurocomputing

Int. J. Approximate Reasoning

Inf. Sci.

Inf. Sci.

Inf. Sci.

J. Parallel Distrib. Comput.

Inf. Sci.

Inf. Sci.

Supervised feature selection: a tutorial

Artif. Intell. Res.

Vector-valued multi-view semi-supervised learning for multi-label image classification

Adaptive hypergraph embedded semi-supervised multi-label image annotation

IEEE Trans. Multimedia

Semi-supervised feature selection via spectral analysis, in