Abstract

For the low optimization accuracy of the cuckoo search algorithm, a new search algorithm, the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm, is improved by feature weighting and elite strategy. The EHBCS algorithm has been designed for feature selection on a series of binary classification datasets, including low-dimensional and high-dimensional samples by SVM classifier. The experimental results show that the EHBCS algorithm achieves better classification performances compared with binary genetic algorithm and binary particle swarm optimization algorithm. Besides, we explain its superiority in terms of standard deviation, sensitivity, specificity, precision, and -measure.

1. Introduction

Feature selection attempts to find the most discriminative subset of features to bring reasonable recognition rates for some classifiers. Given a problem with features, we have possible solutions, making an exhaustive search impracticable for high-dimensional feature spaces. In addition, the high-dimensional data also contains a large number of irrelevant and noise-polluted features, and there is often information redundancy between features. These factors will affect the learning effect of the learning algorithm and significantly increase the algorithm’s computational complexity. Therefore, feature selection has become a research hot spot.

As a key technology of pattern recognition and machine learning, feature selection is an effective method to deal with high-dimensional data. Feature selection models can be divided into three categories [1]: filter [2], embedding [3], and wrapper [4]. Filter methods define the relevant features without prior classification of the data. The embedding method refers to the process of embedding the feature selection algorithm into the classification algorithm and conducting feature selection and training at the same time. Wrapper methods on the other hand incorporate classification algorithms to search for and select relevant features. The wrapper methods generally outperform filter methods in terms of classification accuracy [5]. Recent studies have shown that feature selection can better solve many practical problems, including classification and medical problems [69].

Another vital part of the feature selection process is the search strategy: selecting the feature subset that meets the optimal evaluation criteria, which is usually a combinatorial optimization problem. In recent years, metaheuristic algorithms based on biological behavior and physical systems in nature are proposed to solve the optimization problems [10]. Metaheuristic optimization algorithm, also known as natural heuristic algorithm, studies the evolutionary behavior of species and simulates it into computer science algorithms, including genetic algorithm [11], particle swarm optimization algorithm [12], bat algorithm [13, 14], and cuckoo algorithm [15]. The metaheuristic optimization algorithm has achieved good results in feature selection. For example, Liu et al. [16] combined genetic algorithm and simulated annealing algorithm to select feature subsets. The experiment result expresses the hybrid algorithm has high reliability and strong convergence. On the contrary, Siedlecki and Sklansky [17] combined genetic algorithm and feature selection to achieve a certain effect, but it exposed the problem of premature convergence of genetic algorithm. Kennedy and Eberhart [18] proposed the binary particle swarm optimization algorithm called BPSO, which modified the traditional particle swarm optimization algorithm and solves the binary optimization problems. Besides, Firpi and Goodman [19] applied BPSO to feature selection problems.

The success of metaheuristic methods lies in the efficiency of search strategies and its ability to find solutions to combinatorial optimization problems. Metaheuristics take the information gathered during the search to guide the search process, and therefore, they are considered independent of the problems. The cuckoo search algorithm is a novel heuristic optimization approach introduced by Yang and Deb in 2009 [15]. The algorithm simulates cuckoo birds’ parasitic breeding habits and is a random algorithm with strong global search ability. The cuckoo search algorithm has been efficiently employed in many fields, such as intelligent optimization and calculation. Cuckoo search is superior to other algorithms in continuous optimization problems including spring design and welding beam in engineering design applications [20]. This algorithm is especially suitable for large-scale problems [21]. Valian et al. have applied it in training the neural network [22] and spike neural model [23]. The experiment proved that CS has better search capability than other algorithms like particle swarm optimization algorithm, genetic algorithm, and artificial bee colony algorithm [21, 24, 25]. Therefore, CS is a metaheuristic algorithm used in combinatorial optimization problems to obtain higher performance.

The CS can only solve optimization problems in the continuous solution space. To solve combinatorial optimization problems in discrete solution space, Gherboudj et al. [26] proposed a binary version of the cuckoo search algorithm, namely, BCS algorithm. Pereira and Rodrigues [27] applied BCS algorithm to feature selection. Bhattacharjee and Sarmah [28] improved BCS by using the balance combination of local random walk and global exploration random walk so that BCS algorithm can better balance locality and globality. Sudha and Selvarajan [29] presented a feature selection approach based on an enhanced cuckoo algorithm and applied it to breast X-ray images. It can supply valuable information for clinicopathologists. Aziz and Hassanien [30] proposed a new improved cuckoo algorithm combined with the theoretical knowledge of rough set and finally applied it to feature selection.

The cuckoo search algorithm uses Lévy flight random walk to search space in the iteration. The cuckoo search cannot effectively search around the cuckoo’s nest due to the Lévy flight with sharp 90-degree turns. Therefore, it suffers from low optimization accuracy [31]. In order to improve the cuckoo search algorithm, this paper proposes an Elite Hybrid Binary Cuckoo Search algorithm, and the novelty of the paper is two-fold: (1)EHBCS adopts feature weighting and elite strategy in the binary cuckoo search algorithm. Feature weighting based on Relief algorithm is to estimate the feature weight and its importance according to the ability of each feature to distinguish different class instances. Elite strategy and genetic algorithm with the selection and crossover operators are embedded into the cuckoo algorithm so that the well-positioned nests can be inherited to the next generation(2)EHBCS is applied to a set of binary label datasets, including low-dimensional and high-dimensional samples such that only the best features are retained in the subset. Experimental results demonstrate that EHBCS achieves a better classification performance to minimize the number of selected features, simultaneously maximizing the classification accuracy by SVM compared with binary genetic algorithm and binary particle swarm optimization

The main contributions of this paper are summarized as follows: (1) It is the first time to combine the feature weighting and elite strategy with BCS algorithm. (2) It specifically improves the low optimization accuracy of the BCS algorithm. (3) It may provide a useful revelation to high-dimensional data researches such as text processing, medical research, and gene analysis.

The structure of this paper is as follows: Section 2 provides details of the classical version of the Cuckoo Search and Binary Cuckoo Search algorithms; Section 3 presents the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm; Section 4 discusses the experimental methodology and in particular the dataset and evaluation measures; numerical experiment is also carried out to evaluate the prediction performance of our method in Section 5. The results demonstrate that the proposed method is efficient for high-dimensional datasets; finally, the conclusions of our work are given in Section 6.

2. Cuckoo Search Algorithm

2.1. Cuckoo Search (CS) Algorithm

The parasite behavior of cuckoos is extremely intriguing. These birds can lay down their eggs in host nests and mimic external characteristics of host eggs such as color and spots. If this strategy is unsuccessful, the host can throw the cuckoo’s eggs away or simply abandon its nest, making a new one in another place. Based on this context, Yang and Deb [15] have developed a novel evolutionary optimization algorithm named cuckoo search (CS), and they have summarized CS using three rules, as follows: (1)Each cuckoo chooses a nest to lays eggs randomly(2)The number of available host nests is fixed, and nests with high-quality eggs will be passed on the next generations(3)If a host bird discovered the cuckoo egg, it can throw the egg away or abandon the nest and build a completely new nest

For optimization problems, each nest represents a possible solution to the problems, and a nest can contain one or more eggs depending on the size of the problems. Firstly, the algorithm randomly initializes each nest, and then, the algorithm carries out an iterative process. During each iteration, each nest is updated by Lévy flight with random walk, and the formula is shown in Equations (1) and (2):

The updating formula of each dimension is expressed as

where denotes th nest and stands for the th eggs at nest for the generation. is step size, and the product means entrywise multiplications. In most case, we can use . The Lévy flights Lévy () employ a random step length, and Lévy() is its th component.

In the 1930s, Lévy proposed Lévy’s distribution, believing that the relationship between the continuous jump path of Lévy’s flight and time follows Lévy’s distribution. Later, many scholars have studied Lévy’s distribution and used it to explain random phenomena in nature, such as Brownian motion and random walk. Yang [15] studied and obtained the probability density function of Lévy distribution in power form by simplifying and Fourier transform:

where is the power coefficient. Equation (2) is a probability distribution with a heavy tail. Although it can essentially describe the random walk process of cuckoo birds, it has not been further described in a more concise and easy to program mathematical language to achieve the CS algorithm. So Yang adopted the Mantegna algorithm to simulate Lévy jump path:

where is the Lévy flight Lévy(), the relation of parameters in equation (2) is and content 02. The parameter is , and and are random number and satisfy Equations (5) and (6):

Let then step is the path that cuckoo bird experiences each time in solution space when it randomly searches for the new nest location from the old nest location according to Equation (2). In the finally step of each iteration, the nest with the worst quality is substituted with probability p[0,1]. Algorithm 1 shows the pseudo-code for the classical version of CS.

Objective function f(x)
Generate initial population of n host nest ,(i=1,2,,n)
while () or (stop criterion)do
 Get a cuckoo randomly by Levy flights evaluate its quality/fitness
 Choose a nest among n (say, j);
if () then
  replace j by new solution;
end
 A fraction () of worse nests are abandoned and new ones are built
 Keep the best solution (or nests with quality solutions)
 Rank the solutions and find the current best
end
Postprocess results and visualization
2.2. Binary Cuckoo Search (BCS) Algorithm

In traditional CS, the position of the solution is updated in the continuous search space. Unlike the above CS, the BCS search space for feature selection is modeled as a binary -bit string, where is the number of features. BCS represents each nest as a binary vector, where each 1 corresponds to a selected feature and 0 otherwise. This means each nest represents a possible solution, and each nest represents a feature.

The original cuckoo algorithm introduces mapping functions to extend the cuckoo algorithm to discrete binary regions as follows [25]: in which and denotes the new egg’s value at iteration .

3. Elite Hybrid Binary Cuckoo Search (EHBCS) Algorithm

3.1. Feature Weighting Based on Relief Algorithm

The core idea of feature weighting based on Relief is to estimate the feature weight and its importance according to the ability of each feature to distinguish different class instances [32]. Given a two-class dataset , containing cases is a class label set, is a case in , and is a real-valued vector with dimension . Relief performs the following iterative learning: randomly select a case , then find the nearest case of the same class and the nearest case of the different class, and then update the weight using the following rules: where represents the weight of the th feature and represents the maximum number of iterations. is used to calculate the difference between the th dimensional eigenvalues of two instances, that is, the absolute value vector of the feature difference vector.

A variant that considers neighbors has been developed from the nearest neighbor Relief, whose weight value update formula is where is the set of nearest neighbors of in by Euclidean distance. Process is shown in Algorithm 2.

Input: binary label dataset D with n cases and d dimensions, Maxiter T
Output: weight vector w
fordo
end
whiledo
 Randomly select an case from the dataset and calculate the distance between nearest cases of the same kind and nearest cases of the different kind ;
fordo
   generated by formula (11);
  ;
end
end
3.2. Selection and Crossover Operator

The selection operator is to inherit the individuals with high fitness in the current population to the next generation according to selection probability. Generally, individuals with high fitness will have more chances to inherit to the next generation. This paper uses the roulette model to select individuals. The calculation formula is as follows:

where is the selection probability, is the cumulative probability, is the individual fitness function value, and is the number of the group. Select operator process is in Algorithm 3.

Crossover is to cross the selected a pair of individuals according to probability, such as single-point crossover or multipoint crossover. In this paper, the single-point crossover is adopted, that is, the random number is generated within the range of individual coding bits as the crossover point, and then, the coding exchange of the two bodies from this point to the end is carried out, so that the crossover process can be completed.

3.3. Weight-Based Elite Hybrid Binary Cuckoo Search (EHBCS) Algorithm

In the CS algorithm, the Lévy flight is used to explore the search space using a straight flight path with a sudden 90-degree turn, and Figure 1 simulates Lévy’s flight path. In addition, the CS algorithm is highly dependent on random walk search, which can be easily moved from one area to another without carefully exploring each nest. Therefore, the CS algorithm has weak local search ability and low optimization accuracy [31]. In order to cover the mentioned weakness of the CS, elite strategy and genetic algorithm operators are embedded into the cuckoo algorithm, such as selection and crossover operators, so that the well-positioned nests can be inherited to the next generation. The so-called elitist strategy is to preserve the nest in a good location so as not to miss the optimal nest during the algorithm iterations by Lévy flight. According to certain rules, the selection operator is to inherit the individuals with high fitness in the current population to the next generation. Generally, individuals with high fitness will have more chances to inherit to the next generation. The crossover operator usually inputs two individuals as candidate solutions with a certain probability and generates neighborhood solutions by exchanging part of the chromosomes of two individuals.

The CS algorithm is suitable for continuous domain problems, and the feature selection is a binary discrete problem. This paper proposes an Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm considering these facts. The EHBCS algorithm weights the features firstly according to the Relief algorithm mentioned in part III-A, so that the features with larger weights have greater opportunities to be selected. Then, in each iteration of the EHBCS algorithm, the optimal nest does not carry out Lévy flight or crossover to avoid damaging the optimal nest position. The nest generated by Lévy flight is operated by selection and crossover operators.

Since the existing BCS algorithm does not consider the influence brought by the function, the coefficient in the function is changed to the feature weight in this paper so that features with significant feature weight have a greater chance to be selected and the improved algorithm can finish the iterative process faster. The BCS mapping function is modified as follows:

When

When

The function of does not represent the probability of change, and it represents the probability of a certain change being 1. Let . The corresponding function graph is shown in Figure 2. It can be seen from the figure that the greater the parameter of the same abscissa, the greater the corresponding value. That is, the greater the feature weight, the greater the probability of being selected.

It should be emphasized that the weights calculated by the Relief algorithm may have negative weights, and the negative weight indicates that the distance of the similar neighbor samples is larger than that of the nonsimilar neighbor samples. Therefore, it is considered that this feature is unfavorable to classification, and the probability of selecting this feature in the corresponding feature selection is low.

Because the purpose of nest discovery and crossover operation is to make the population various, this paper adopts crossover operation instead of discovery operation. In the late iteration of the algorithm, the elite strategy proposed in this paper ensures the convergence. The elite selection and crossover operators as well as the pseudo-code of the algorithm presented in this paper are as follows: Algorithm 3 and Algorithm 4.

Input: population with nests, number of dimensions (features) , crossover rate , fitness function
Output: New population after elite selection and crossover
fordo
 p()=, ;
fordo
 Generate a random number from [0,1];
if () then
 Select the ;
else
ifthen
 Select the ;
   end
  end
end
end
Train the classifier to evaluate accuracy of ;
Calculate the fitness function value and store it in ;
;
;
;
Two nests in the population are paired at random for each pair except , such as and
if () then
Generate a random integer r in (1,d) with one-point crossover between two individuals ;
;
end
The crossed nests and Bestness form a new population as output
Input: labelled dataset , Maxiter , CS parameters value, number of nests , number of dimensions (features)
Output:
fordo
randomly generate a binary 0-1 string;
Train the classifier to evaluate accuracy of ;
Calculate the fitness function value and store it in ;
end
;
;
;
while or (stop criterion)do
fordo
  fordo
newnest generated by formula (14)–(17) and store it in ;
end
end
A new population with n members of and ;
Train the classifier to evaluate accuracy of ;
Calculate the fitness function value and store it in ;
fordo
if () then
;
;
end
end
Generate new population after elite selection and crossover;
Train the classifier to evaluate accuracy of ;
Calculate the fitness function value and store it in ;
;
;
;
end

4. Experimental Methodology

4.1. Datasets

Eight datasets were extracted from the UCI Machine Learning Repository [3335]. In order to make a more comprehensive comparison between the proposed algorithm and other algorithms, four low-dimensional feature datasets and four high-dimensional feature datasets are selected. Each dataset has two classes, and Table 1 provides the datasets’ names, the total number of features, total number of cases, and classification accuracy before feature selection.

4.2. Performance Evaluation Measures

Generalization ability is the ability of a model to predict new data accurately after training on the training datasets. Cross-validation is a method to evaluate model generalization ability, which is widely used in data mining and machine learning [36]. In cross-validation, the dataset is usually divided into two parts: the training set, which is used to build a prediction model, and the other is test set, which is used to test the model’s generalization ability. Cross-validation was performed, and the value of was set to for datasets with cases below 100 and to for datasets with cases above 100. The evaluation indicators used include Accuracy, Sensitivity, Precision, and F-measure [37].

whereis the total number of positive cases and correctly identified as positive,is the total number of negative cases and correctly identified as negative,is the total number of negative cases and wrongly identified positive cases, andis the total number of positive cases and wrongly identified negative cases.

For the overall classification performance of each algorithm, we calculate the average value of all tests as follows: where is the total number folds.

4.3. Evaluating Classification Performance

The support vector machine (SVM) classifier was adopted to evaluate the accuracy of feature subset classification. SVM is a supervised machine learning algorithm introduced by Boser et al. [38], in which data is mapped as the points in an -dimensional feature space (). The final output of SVM is an optimal hyperplane that classifies new cases.

SVM highly depends on kernel functions, so the experiments with different kernel functions are fundamental. The kernel function is a similarity function, which determines the similarity between any two inputs by calculating the distance between them. It is not difficult to determine the kernel function. Any function that satisfies the Mercer theorem can be used as a kernel function. There are various types of kernel functions such as linear kernel function, polynomial kernel function, radial basis kernel function, Sigmoid kernel function, and composite kernel function. Selecting the appropriate kernel function is relevant to the datasets and the problems. Therefore, it is often selected experimentally. Based on experiments, suitable kernel functions are selected to evaluate the datasets. The selected kernel functions are presented in Table 2.

4.4. Fitness Function

The main objective of the feature selection task is to find a subset of features from the dataset so that the learning algorithm can use these selected features to achieve as high accuracy as possible.

In the classification problems, two feature subsets with different numbers likely have the same classification accuracy for the same dataset. Therefore, in the case of the same classification accuracy, if the metaheuristic algorithm finds the subset with more features earlier, the subset with fewer features will be ignored. In this paper, a new evaluation method is proposed as the fitness function to overcome this constraint, which considers the classification accuracy and takes the rate of feature reduction as an adjusting term.

Let be the total number of features contained in the datasets, be the number of features selected by metaheuristic optimization algorithms, be the weight of rate of feature reduction, and 1- be the weight of average accuracy. The value of the adaptation fitness function can be calculated as shown in (28). We set =0.2.

4.5. Parameter Setting

The performance of the proposed EHBCS is compared against the Binary Genetic Algorithm (BGA) and Binary Particle Swarm Optimization (BPSO) algorithms. Table 3 lists the parameter values for each algorithm. The population size of all optimization algorithms is set to 30, and each algorithm was run 5 times to perform the feature selection task. All runs are executed in Matlab 2017, running on a Windows 10 operating system on a Huawei MagicBook with Intel(R) Core(TM) i5-8250U 1.6GHz with 8Gb of RAM.

4.6. Analysis of Computational Complexity

The EHBCS algorithm uses the Relief algorithm and the binary conversion of Lévy flight as well as the selection and crossover process. For the Relief algorithm, assuming that the number of runs is , the number of iterations is , the number of cases is , and the individual dimension is ; the complexity of the algorithm is . For Lévy flight and binary conversion, assuming that the number of individuals is , the individual dimension is , and the number of iterations is ; the computational complexity is . For selection and crossover, assuming the number of individuals is , the computational complexity is . Therefore, the computational complexity is for EHBCS algorithm.

5. Experimental Results

Figures 3 and 4 provide the performance of all optimization algorithms for feature selection using the medical datasets described in Section 4.1. They contain the following information:

Accuracy: classification accuracy for each datasets

All: classification accuracy before feature selection for each dataset

SR: size reduction percentage is used to evaluate the percentage of removed features compared to all available features

Tables 4 and 5 provide the performance of all optimization algorithms for feature selection using binary label datasets described in Section 4.1. Each table column contains the following information:

Fitness: is accuracy as defined in Section 4.2 Function (23), and is the proposed Function (28) as defined in Section 4.4

Algorithm: it provides the abbreviations of the algorithms, Elite Hybrid Binary Cuckoo Search (EHBCS), Binary Genetic Algorithm (BGA), and Binary Particle Swarm Optimization (BPSO)

Avgacc, Max, Min: average accuracy, maximum accuracy, minimum accuracy of an algorithm during the 5 runs

Std: standard deviation of classification accuracy

AvgN: average number of features returned by the algorithm during the 5 runs

SE, SP, Pre, F1: average sensitivity, specificity, precision, -measure of an algorithm during the 5 runs

Dataset: the dataset used for experimentation as described in Table 1

Avg: average of all corresponding data obtained by the three algorithms

The experimental results show that the average feature subsets are smaller for all datasets, and the average classification accuracy is improved to different degrees. Compared with the original datasets, the number of the average feature subsets after feature selection by the optimization algorithms was reduced by about 18.395%-89.667%, and the average classification accuracy was improved by about 3.3%-34.6%. For the Breast Cancer Wisconsin (diagnostic) dataset, the maximum average classification accuracy improvement was achieved at 34.6%. All these imply that the feature selection methods based on metaheuristic optimization algorithms can effectively eliminate redundant features and significantly improve the classification accuracy especially for some datasets.

For low-dimensional datasets, such as Cervical Cancer Behavior Risk, Breast Cancer Wisconsin (diagnostic), Breast Cancer Wisconsin (prognosis), and Sonar, the EHBCS algorithm can effectively reduce features to obtain a smaller subset of target features. It can get minimum standard deviation in three algorithms, which shows the EHBCS algorithm is the most stable of three. But it is the second of the three optimization algorithms in terms of classification accuracy, SE, SP, Pre, and F1. Compared with the data corresponding to Avg, the EHBCS algorithm has minimum standard deviation, higher classification accuracy, SE, SP, Pre, and F1 in entirety. Compared with the original dataset classification, the number of subset features after feature selection by the EHBCS algorithm is reduced by 58.182%-80%, and the classification accuracy is improved by 5%-33.9%. The results show that the EHBCS algorithm can efficiently diminish the number of features to ensure accuracy, but it did not perform well in low-dimensional datasets.

For high-dimensional datasets, such as Colon Tumor, Medulloblastomas, Central Nervous System and Relation Leukemia, the average classification accuracy, standard deviation, SE, SP, Pre, and F1 obtained by the EHBCS algorithm were superior to BGA and BPSO on the whole. Compared with the data corresponding to Avg, the average classification accuracy of the EHBCS algorithm is improved by 1%-10.6%, and the EHBCS gets lower standard deviation. But it needs to be explained that the standard deviation of the EHBCS algorithm is greater than the data corresponding to Avg when adopting fitness (Function (23)) for dataset Medulloblastomas and Central Nervous System. In addition to these, SE, SP, Pre, and F1 are optimal overall. Compared with the original dataset classification, the number of subset features after feature selection by the EHBCS algorithm is reduced by 43.772%-53.498%, and the classification accuracy is improved by 4.5%-22.8%. The results show that the feature selection method based on EHBCS has higher classification accuracy, SE, SP, Pre, F1, and smaller standard deviation. EHBCS algorithm is more suitable for the feature selection of high-dimensional datasets.

It should be emphasized that the purpose of feature selection is to reduce irrelevant or weakly correlated features as much as possible on the premise of ensuring classification accuracy. However, the number of feature subsets cannot be reduced indefinitely. Too few feature subsets may lead to the loss of important features, thus affecting the classification accuracy of the datasets. Therefore, it is necessary to balance the relationship between classification accuracy and the number of feature subsets. In practical applications, evaluation function models should be set scientifically and reasonably to ensure the classification performance of feature subsets.

6. Conclusion

This paper proposes an Elite Hybrid Binary Cuckoo Search Algorithm that adopts feature weighting and elite strategy. The proposed EHBCS algorithm aims to optimize the feature selection task on binary label datasets. The experimental results show that EHBCS achieves a better classification performance. Besides, all statistical metrics (standard deviation (Std), sensitivity (SE), specificity (SP), precision (Pre), and -measure (1)) reveal markedly the EHBCS is superior to BGA and BPSO. However, the algorithm still has shortcomings, such as increased computational complexity.

Future work requires further modification of the proposed algorithm to make it suitable for feature selection of multiclass datasets and to evaluate the results using different datasets and classification models.

Data Availability

The data are available at the dataset site: http://archive.ics.uci.edu/mlhttp://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgihttp://csse.szu.edu.cn/staff/zhuzx/Datasets.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.