Similarity-based constraint score for feature selection
Introduction
In machine learning and pattern recognition applications, such as data mining and image analysis, datasets are often characterized by a large number of features. The processing of such high-dimensional data requires large memory storage and high computational time, and may lead to poor learning performance [1], [2]. To address these drawbacks, the dimensionality of data is often reduced by selecting relevant features. Typically, feature selection methods can be categorized into three types: filter, wrapper, and embedded methods [2], [3]. Filter methods evaluate features independently of the classification algorithm, while wrapper methods exploit a classification algorithm to evaluate the relevance of features. Embedded methods embed feature selection into the learning algorithm. As filter methods do not depend on any classification scheme, we focus on these methods [2], [3].
According to the availability of prototypes (i.e., labeled data samples that represent classes), feature selection methods can also be divided into unsupervised, supervised, and semi-supervised approaches [1], [2], [3]. Supervised feature selection only uses prototypes to measure the correlation of each feature with the class labels, while unsupervised feature selection analyzes unlabeled data samples to evaluate the feature capacity to preserve the intrinsic data structure [1]. Semi-supervised feature selection takes into account both prototypes and unlabeled data samples to evaluate the relevance of features.
In supervised and semi-supervised learning frameworks, besides class labels of prototypes, the available information can be also expressed by must-link and cannot-link constraints. A must-link constraint specifies that two data samples belong to the same class, while a cannot-link constraint specifies that two data samples belong to different classes [4]. Pairwise constraints can be provided by the user or easily generated from a small number of prototypes.
Must-link and cannot-link constraints are used to estimate the relevance of features via score functions, called constraint scores [1], [2]. These scores introduce constraints into similarity matrix that computes the similarity between data samples. Zhang et al. [5] proposed two supervised constraint scores that use only pairwise constraints to evaluate the relevance of features. Zhao et al. [6] defined a semi-supervised constraint score that analyzes both pairwise constraints and unlabeled data samples for feature selection. Kalakech et al. [1] combined an unsupervised score computed from unlabeled data samples with a supervised score that is computed from the pairwise constraints. This score is predicted to be less sensitive to constraint changes. Two semi-supervised constraint scores that assess the ability of a feature to preserve the local properties of unlabeled data samples while respecting pairwise constraints, have been proposed by Benabdeslem et al. in [7] for the former and in [8] for the latter. More recently, Yang et al. introduced a new semi-supervised constraint score which takes advantage of the local geometrical structure of unlabeled data samples as well as constraints deduced from prototypes [9], [10].
The above-mentioned constraint scores are part of the filter approach, they evaluate features one by one [2]. The score of a feature subset is estimated by the sum of the individual feature scores, and the evaluation of a feature subspace ignores correlation between features. Thus, learning algorithms that operate in a subspace of individually relevant features do not necessarily provide favorable results [8]. In addition, the constraint scores proposed in the literature are based on the analysis of a similarity matrix. Because the similarity matrix is computed in the original feature space, state-of-the-art feature scores can also be corrupted by the curse of dimensionality.
In this paper, we propose a new constraint score that evaluates the relevance of features in the context of both supervised and semi-supervised learnings. Our score assesses the ability of features to respect the available set of pairwise constraints. Unlike existing constraint scores that evaluate the relevance of each feature, our score evaluates a subset of several features simultaneously. The proposed score is then used as a criterion by a sequential forward selection scheme to identify the most relevant subset of features [11].
The performance of the constraint scores is measured by the classification accuracy of the test data commonly obtained by the nearest neighbor classifier. Previous studies use the entire training dataset with true class labels as prototypes by the classifier, while only few prototypes are analyzed by the constraint scores. By following conditions that are similar to those in real-life applications, in this paper we propose using only available information. In the supervised context, only the prototypes involved in pairwise constraint generation are analyzed by the classifier. In the semi-supervised context, we follow the same strategy proposed by Kalakech et al. [12] that first performs the constrained K-means algorithm [4] to classify the unlabeled training data samples and then uses the classified samples as prototypes in classifying the test data. Instead of performing the constrained K-means algorithm, we use constrained spectral clustering, which is based on the same similarity matrix concept used by the constraint scores.
The remainder of this paper is organized as follows. Section 2 provides brief definitions on spectral graph theory and pairwise constraint generation. In Section 3 a primary state-of-the-art about constraint scores is presented. Our proposed constraint score and the feature selection procedure are presented in Section 4. Experimental results achieved with benchmark databases related to supervised and semi-supervised feature selection are provided and discussed in Section 5.
Section snippets
Preliminaries
Constraint scores are based on the concepts of spectral graph theory and pairwise constraints. In this section, we briefly give some notations and definitions related to these two concepts.
Constraint scores
The performance achieved by learning algorithms such as classification or clustering depends on similarity, which is based on the Euclidean distance in the original -dimensional feature space. Because these features are not always relevant, many authors select the best ones thanks to constraint scores that combine the concepts of spectral graph theory and pairwise constraints.
Proposed constrained feature selection
Existing constraint scores estimate the relevance of each feature considered independently and separately from each other. Because these scores do not take into account the correlation between features, we propose a new constraint score that estimates the relevance of a subset of features at once.
Experiments on benchmark databases
We evaluate and compare our proposed constraint score with several constraint scores and several well-known feature selection methods on datasets originating from benchmark databases. We first examine the supervised feature scores (, and ). Then, we assess the performance attained by semi-supervised feature scores (, , , , and ). Because the data are scaled between and , the scaling parameter used to compute the similarity matrices are set to . Feature selection
Conclusion and future work
In this paper, we presented a new constraint score for feature selection in the context of both supervised and semi-supervised learnings. This score evaluates a subset of features at once, whereas state-of-the-art constraint scores evaluate only one feature at a time. This makes it possible to identify redundant features and to avoid the problem of correlation between features. Because our score evaluates similarity between data samples in the examined feature subspace, selected features can be
CRediT authorship contribution statement
Abderezak Salmi: Conceptualization, Methodology, Software, Validation, Writing - original draft, Writing - review & editing. Kamal Hammouche: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing, Supervision. Ludovic Macaire: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (23)
- et al.
Constraint scores for semi-supervised feature selection: A comparative study
Pattern Recognit. Lett.
(2011) - et al.
A survey on semi-supervised feature selection methods
Pattern Recognit.
(2017) - et al.
Feature selection in machine learning: A new perspective
Neurocomputing
(2018) - et al.
Constraint score: A new filter method for feature selection with pairwise constraints
Pattern Recognit.
(2008) - et al.
Locality sensitive semi-supervised feature selection
Neurocomputing
(2008) - K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge, in:...
- et al.
Constrained laplacian score for semi-supervised feature selection
- et al.
Efficient semi-supervised feature selection: Constraint, relevance, and redundancy
IEEE Trans. Knowl. Data Eng.
(2014) - et al.
Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score
EURASIP J. Audio Speech Music Process.
(2016) - et al.
Semi-supervised minimum redundancy maximum relevance feature selection for audio classification
Multimedia Tools Appl.
(2018)
On automatic feature selection
Int. J. Pattern Recognit. Artif. Intell.
Cited by (4)
Class-specific feature selection via maximal dynamic correlation change and minimal redundancy
2023, Expert Systems with ApplicationsIterative constraint score based on hypothesis margin for semi-supervised feature selection
2023, Knowledge-Based SystemsA novel feature selection method via mining Markov blanket
2023, Applied Intelligence3-3FS: ensemble method for semi-supervised multi-label feature selection
2021, Knowledge and Information Systems