Abstract

A computer-aided diagnosis (CAD) system that employs a super learner to diagnose the presence or absence of a disease has been developed. Each clinical dataset is preprocessed and split into training set (60%) and testing set (40%). A wrapper approach that uses three bioinspired algorithms, namely, cat swarm optimization (CSO), krill herd (KH) ,and bacterial foraging optimization (BFO) with the classification accuracy of support vector machine (SVM) as the fitness function has been used for feature selection. The selected features of each bioinspired algorithm are stored in three separate databases. The features selected by each bioinspired algorithm are used to train three back propagation neural networks (BPNN) independently using the conjugate gradient algorithm (CGA). Classifier testing is performed by using the testing set on each trained classifier, and the diagnostic results obtained are used to evaluate the performance of each classifier. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. Experimentation has been carried out using seven clinical datasets from the University of California Irvine (UCI) machine learning repository. The super learner has achieved a classification accuracy of 96.83% for Wisconsin diagnostic breast cancer dataset (WDBC), 86.36% for Statlog heart disease dataset (SHD), 94.74% for hepatocellular carcinoma dataset (HCC), 90.48% for hepatitis dataset (HD), 81.82% for vertebral column dataset (VCD), 84% for Cleveland heart disease dataset (CHD), and 70% for Indian liver patient dataset (ILP).

1. Introduction

Data related to symptoms observed on a patient at a point of time are stored in electronic health records (EHRs). Interesting patterns can be extracted from the data that are stored in EHRs, and the extracted patterns can be represented as knowledge, and this knowledge can assist the physicians to diagnose the presence or absence of a disease. Data mining tasks, namely, association rule mining, classification, and clustering are used to mine valuable patterns from the data stored in EHRs. Clinical decision support systems (CDSSs) that assist the physicians to diagnose the presence or absence of a disease can be developed from data stored in EHRs using bioinspired algorithms and data mining techniques. Although several algorithms have been proposed by researchers for association rule mining, classification, and clustering, no algorithm can be deliberated to be the “universal best.” Quality of data and data distribution are the two key factors that determine the effectiveness of a data mining task. The performance of a data mining task depends on how effective data preprocessing has been done. Classification plays a major role in the development of CDSSs. Classification is a two-step process, first, building the classifier and second, model usage. Building the classifier is the process of training the classifier with a supervised learning algorithm. Model usage is the process of estimating the accuracy of the classifier using testing instances commonly referred to as testing set. Overfitting and underfitting are two major problems associated with building the classifier.

Clinical dataset () used for classifier construction is split into a training set and a testing set . Researchers have proposed different methods to identify the and . One common method is to split 80% of the dataset into and 20% of the dataset into . For clinical decision-making, a balanced dataset is essential for building a prediction model. Clinical datasets are normally not balanced, and classification methods perform poorly on minority class samples when the dataset is tremendously imbalanced. For example, consider a with instances, each instance associated with a class label or . Among the instances that 75% of the instances in are associated with class label , and 25% of the instances in are associated with class label , it is evident that the class labels in are not equally represented and therefore, the is imbalanced. In this context, is the majority class, and is the minority class, and hence, constructing a classifier with class-imbalanced data will lead to bias in favor of the majority class. One method to handle class imbalance in a is to generate additional instances from the minority class. The Synthetic Minority Oversampling Technique (SMOTE) [1] is one of the prevailing methods used to generate additional training and testing instances.

A training instance can be defined as a tuple , where represents a training instance, and represents the features corresponding to a training instance. The subscript in can range from , where is the number of instances. The subscript in can range from , where is the number of features. Using irrelevant features to train a classifier will affect its performance. Selecting the optimal features from the and then training the classifier will enhance the accuracy of the classifier. Feature selection methods can be supervised, unsupervised, and semisupervised depending upon whether the training set is labeled or not. Commonly used supervised feature selection methods are filter and wrapper methods. The filter method considers the dependency of each feature to the class label and is independent of any classification algorithm. Measures, namely, information gain [2], gain ratio [3], Gini index [4], Laplacian score [5], and cosine similarity [6] can be used to rank the features. Other measures to rank the features can also be used in filter method. The wrapper method considers the classification accuracy of a learning algorithm to select the relevant features. Researchers are using a confluence of disciplines to develop computer-aided diagnostic (CAD) systems to assist physicians.

Knowledge mining using rough sets for feature selection and backpropagation neural network (BPNN) for classifying clinical datasets has been proposed in [7]. A CDSS to diagnose Urticaria using Bayes classification is proposed in [8]. CDSSs to diagnose lung disorders are proposed in [914]. A CDSS to diagnose the severity of gait disturbances using a -backpropogated time delay neural network on patients affected by Parkinson’s disease is proposed in [15]. A statistical tolerance rough set induced decision tree classifier to classify multivariate time series clinical data is proposed in [16]. A CDSS to diagnose gestational diabetes mellitus using the fuzzy logic and radial basis function neural network is proposed in [17]. Use of fuzzy sets and extreme learning machine to classify clinical datasets is proposed in [18]. Wind-driven swarm optimization, a metaheuristic method to classify clinical datasets, is proposed in [19]. A computer-aided diagnostic system that uses a neural network classifier trained using differential evolution, particle swarm optimization, and gradient descent backpropagation algorithms is proposed in [20]. A radial basis function neural network to classify clinical datasets using -means clustering algorithm and quantum-behaved particle swarm optimization is proposed in [21]. Classifying clinical unevenly spaced time series data by imputing missing values has been proposed in [22]. A framework to classify unevenly spaced time series clinical data using improved double exponential smoothing, rough sets, neural network, and fuzzy logic is proposed in [23].

An outline of nature-inspired algorithms for optimization is presented in [24]. The cooperative intellectual actions of insects or animal groups in nature, for example, colonies of ants, schools of fish, flock of birds, swarms of bees, and termites, have fascinated the thoughtfulness of researchers. Entomologists have studied the collective actions of insects or animals to model biological swarms, and engineers have applied these models as a framework to solve complex real-world problems.

In this work, a CAD system that employs a super learner to diagnose the presence or absence of a disease has been proposed. The bioinspired algorithms used in this work are cat swarm optimization (CSO), krill herd (KH), and bacterial foraging optimization (BFO). The classifiers used in this work are support vector machine (SVM) and BPNN trained using the conjugate gradient algorithm.

The rest of the paper is organized as follows: the abbreviation used in the manuscript is presented in Section 2. An outline of the related work is presented in Section 3. An outline of the datasets used is presented in Section 4. The framework of the proposed classifier is presented in Section 5. The results and discussions are presented in Section 6. Finally, conclusion and scope for future work are presented in section 7.

2. Abbreviations Used

Table 1 presents the abbreviation used in the rest of the manuscript in alphabetic order.

3. Literature Survey

Leema et al. [25] in their work have experimented the significance of fixing the appropriate values of parameters to train artificial neural networks using the backpropagation algorithm. The parameters are initial weight selection, bias, activation function used, number of hidden layers, number of neurons per hidden layer, number of training epochs, minimum error, and momentum term. Twelve backpropagation learning algorithms have been used in this study. Experimentation has been carried out using three clinical datasets from the UCI ML repository, namely, PID, hepatitis, and WBC datasets.

Elgin et al. [26] in their work have proposed a clinical-decision making system to diagnose allergic rhinitis. A wrapper approach that uses GA and the accuracy of ELM classifier as the fitness function has been used for feature selection. The selected features have been trained using ELM classifier. Intradermal skin test dataset of 872 patients collected from Good Samaritan Lab Services and Allergy Testing Centre, Chennai, has been used in this work, and an accuracy of 97.7% has been achieved.

Sreejith et al. [27] in their work have proposed a framework for classifying clinical datasets which uses an embedded approach for feature selection and a DISON for classification. The feature selection is performed by computing the feature importance of every attribute using an extremely randomized tree classifier. Classification is performed using DISON which is a feed forward neural network whose weights and bias are optimized in two stages first, by using a strawberry optimization algorithm and then by using a gradient descent BP algorithm. Vertebral column, PID, CHD, and SHD datasets from the UCI ML repository have been used for experimentation. The framework has achieved an accuracy of 87.17% for vertebral column, 90.92% for PID, 93.67% for CHD, and 94.5% for SHD.

Sreejith et al. [28] in their work have proposed a framework for CDSS which addresses the data imbalance problems associated with clinical dataset. The datasets are rebalanced using SMOTE enhanced using Orchard’s algorithm. The feature selection is performed using a wrapper approach where CMVO is used to select the feature subsets, and RF classifier is used to evaluate the goodness of the features. The arithmetic mean of MCC and -score computed using the RF classifier is used as the fitness function. Finally, an RF classifier, comprising of 100 decision trees which uses information gain ratio as the split criteria, is used for classifying the clinical data. Three clinical datasets from the UCI ML repository, namely, ILP, TS, and PID datasets, have been used for experimentation. The proposed framework achieved 0.65 MCC, 0.84 -score, and 82.46% accuracy for ILP; 0.74 MCC, 0.87 -score, and 86.88% accuracy for TS; and 0.78 MCC, 0.89 -score, and 89.04% accuracy for PID datasets.

Isaac et al. [29] in their work have proposed a CAD system to diagnose pulmonary emphysema from chest CT slices. Spatial intuitionistic fuzzy -means clustering algorithm has been used to segment the lung parenchyma and extracting the RoIs. From the RoIs, shape, texture, and run-length features have been extracted, and feature selection has been performed using a wrapper approach using four bioinspired algorithms with the classification accuracy of SVM as the fitness function. The bioinspired algorithms used are MFO, FFO, ABCO, and ACO. Tenfold crossvalidation technique has been used, and each feature set has been trained using an ELM classifier. Two independent datasets, one dataset consisting of CT slices collected from hospitals and the second dataset consisting of CT slices from a benchmark repository, have been used for classification. A maximum classification accuracy of 89.19% for MFO, 91.89% for FFO, 83.78% for ABCO, 86.49% for ACO, and 75.68% without feature selection have been achieved.

Elgin et al. [30] in their work have performed feature selection and instance selection using a wrapper approach that employs cooperative coevolution with the classification accuracy of the random forest classifier as the fitness function. The optimal feature set is used to train a random forest classifier. Seven datasets, namely, WDBC, HD, PID, CHD, SHD, VCD, and HCC from the UCI ML repository have been used for experimentation. An accuracy of 97.1%, 82.3%, 81.01%, 93.4%, 96.8%, 91.4%, and 72.2% for datasets WDBC, HD, PID, CHD, SHD, VCD, and HCC datasets have been achieved, respectively.

Anter et al. [31] in their work have developed CFCSA by integrating chaos theory and the FCM method to find the optimal feature subset. Ten clinical datasets from the UCI ML repository have been used for experimentation. The features of each clinical dataset have been normalized, and then random chaotic motion has been incorporated into CFCSA in the form of chaotic maps. The objective function of the FCM has been used as the fitness function, in which the crow with the best fitness has been considered the best solution. Comparison has been done with chaotic ant lion optimization, binary ant lion optimization, and the binary crow search algorithm, and it has been inferred that CFCSA outperforms these algorithms in all the datasets used for experimentation.

Elgin et al. [32] in their work have proposed a correlation-based ensemble feature selection using a wrapper approach that employs three bioinspired algorithms using differential evolution, lion optimization, and glowworm swarm optimization with the accuracy of the AdaboostSVM classifier as the fitness function. Tenfold crossvalidation technique has been used, and the optimal features selected have been used to train a gradient descent BP neural network with variable learning rates. Two clinical datasets from the UCI ML repository, namely, hepatitis and WDBC have been used for experimentation. An accuracy of 93.902% for hepatitis and 98.734% for WDBC datasets have been achieved.

Sweetlin et al. [33] in their work have proposed a CAD system to diagnose pulmonary tuberculosis from chest CT slices. The region growing algorithm has been used for segmenting the lung fields followed by edge reconstruction. The manifestations of pulmonary tuberculosis, namely, cavities, consolidations, and nodules have been considered to be RoIs. After extracting the RoIs, and from the RoI, texture features, run-length features and shape features have been extracted, and feature selection has been performed using a wrapper approach that employs the BCS algorithm with the accuracy of one-against-all multiclass SVM classifier as the fitness function. The Cuckoo search algorithm has been implemented in two ways, first, by using entropy measure and second, without using entropy measure. Using the selected feature training is performed using one-against-all multiclass SVM classifier. An accuracy of 85.54% for BCS algorithm with entropy measure and 84.65% accuracy for BCS algorithm without entropy measure have been achieved.

Sweetlin et al. [34] in their work have proposed a CAD system to diagnose pulmonary hamartoma nodules from chest CT slices. Otsu’s thresholding method has been used to segment lung parenchyma from the CT slices. Nodules are considered to be the RoIs and from the RoIs, texture features, shape features and run-length features have been extracted. Feature selection has been performed using filter evaluation measures, namely, CSM and RDM with the ACO algorithm. The features selected by ACO-CSM and ACO-RDM have been used to train three classifiers, namely, SVM, NB, and J48 decision tree classifiers. Maximum classification accuracy of 94.36% for SVM classifier trained with 38 features selected using ACO-RDM has been achieved.

Sweetlin et al. [35] in their work have proposed a CAD system to diagnose pulmonary bronchitis from CT slices of the lung. Optimal thresholding has been used to segment the left and right lung fields from the lung CT slices. The RoIs are identified, and from the RoIs, texture and shape features have been extracted. Feature selection has been performed using a hybrid ACO algorithm combined with tandem run recruitment based on cosine similarity, and the accuracy of the SVM classifier has been used as the fitness function. The selected features have been used to train a SVM classifier. An accuracy of 81.66% for ACO with tandem run strategy, 78.10% for ACO without tandem run strategy, and 75.14% without feature selection has been achieved.

Raj et al. [36] in their work have proposed DGA for feature selection to develop a CAD system to diagnose lung disorders from chest CT slices. The entire dataset has been split into two sets one set containing 90% of the entire dataset and the other set containing 10% of the entire dataset. Out of the 90%, 50% has been used as training set and the other 50% as validation set for evaluating the objective function. The set containing 10% of the entire dataset has been used as testing set. The objective function has been defined as the sum of the squared deviation of each data in the training set of each class from each data in the validation set of the corresponding class. GA has been used for feature selection by minimizing the proposed objective function, resulting in the proposed DGA. The GA has been iterated over several generations to obtain individuals that are best fit with respect to the objective function. Classification has been performed using -NN classifier to classify the RoIs into one of four classes, namely, bronchiectasis, tuberculosis, pneumonia, and normal. An average accuracy of 88.16% with feature selection and an average accuracy of 86.46% without feature selection have been achieved.

Zawbaa et al. [37] in their work have performed feature selection using a wrapper approach that uses the MFO algorithm with the accuracy of -NN classifier as the fitness function. Eighteen datasets from the UCI ML repository have been used for experimentation among which four are clinical datasets. Comparison has been done with PSO and GA, and it has been inferred that MFO outperforms in fourteen datasets among which three are clinical datasets.

Shu-Chuan et al. [38] in their work have presented an algorithm called CSO by modeling the natural behavior of cats. The CSO algorithm considered two biological characteristics of cats, namely, seeking mode and tracking mode. Cats spend utmost of the time when they are awake on resting. Nevertheless, during their rests, their perception is really high, and they are well aware of what is happening around them. Cats continuously observe their environment wisely and consciously and when they perceive a prey, they advance towards it rapidly. Although resting, they move their position cautiously and slowly, occasionally even stay in the original position. Seeking mode has been used to represent this behavior into the CSO, and the tracing mode has been used to represent the behavior of cats advancing towards a prey into the CSO. The performance of CSO has been evaluated by applying CSO, standard PSO, and PSO with weighting factor into six benchmark functions. The results obtained reveal that the proposed CSO performs better compared to PSO and PSO with weighting factor.

Gandomi et al. [39] in their work have proposed a swarm intelligence algorithm named KH algorithm to solve optimization tasks and is centered on the imitation of the herding behavior of krill swarms with respect to precise biological and environmental processes. The fitness function of each krill individual has been defined as the least distance of each individual krill from food and from the highest density of the herd. Three vital actions considered to define the time-dependent position of an individual krill are, one, movement induced by other krill individuals, two, foraging activity, and three, random diffusion. The KH algorithm is tested using twenty benchmark functions and compared with eight algorithms. Experimentation results indicate that the KH algorithm can outperform these familiar algorithms.

Chen et al. [40] have proposed a cooperative bacterial foraging optimization algorithm (CBFO). Two cooperative methods are used to solve complex optimization problems in the original BFO [41] and achieved significant improvement. The serial heterogeneous cooperation on the implicit space decomposition level and the hybrid space decomposition level are the two methods used to improve the original BFO. The authors have compared the performance of two CBFO variants with the original BFO, PSO, and GA on four commonly used benchmark functions. The experimental results indicated that the CBFO achieved a better performance over the original BFO, PSO, and GA.

Chen et al. [42] have proposed an adaptive bacterial foraging optimization (ABFO) for optimizing functions. The adaptive foraging approaches are used to increase the performance of the original BFO. It is achieved by enabling the original BFO to adjust the run-length unit parameter dynamically during the time of algorithm implementation. The experimental results are compared with the original BFO, PSO, and GA using 4 benchmark functions. The proposed ABFO indicates the better performance over the original BFO and competitive with the PSO and GA.

From the literature, it is evident that classifier training using relevant features enhances the accuracy of the classifier. It can also be inferred that wrapper-based feature selection that employs bioinspired algorithms performs better in numerous cases compared to traditional feature selection methods.

4. Outline of the Datasets Used

Seven clinical datasets from the UCI ML repository, namely, WDBC, SHD, HCC, HD, VCD, CHD, and ILP have been used for binary classification. An outline of each dataset used is presented in Table 2.

5. System Framework

The framework for feature selection and classification of clinical datasets using bioinspired algorithms and super learner is presented in Figure 1. The major building blocks of the framework are data preprocessing, feature selection, classifier training, classifier testing, and dataset construction for super learner, super learner training, and testing. Each building block is outlined below.

5.1. Preprocessing

Each () has been subjected to preprocessing prior to feature selection to enhance the quality of data. Mean imputation has been used to handle missing values, and SMOTE is used to handle the class imbalance problem in each by generating additional instances from the minority class.

Normalization has been used to scale the value of a feature so that the value will fall in a specified range and is predominantly useful for constructing a classifier involving a neural network. Training a classifier using normalized data will speedup learning. In this work, the range is 0 to 1, and min-max normalization is being used. When an attribute “” in a clinical dataset is subject to min-max normalization, the minimum value () and maximum value () in the value set of “” are first identified, and normalization is performed using the formula presented in equation (1). If the formula “” is the normalized value of an attribute “,” when is drawn from the value set of “.” Since min-max normalization is being used to normalize the values in the range 0 to 1, the value of is 1 and is 0.

The number of instances in each used for constructing and testing the classifier prior to generating additional samples using SMOTE, the number of instances in each after generating additional samples using SMOTE, the number of instances in the training set , and the number of instances in the testing set is presented in Table 3. After preprocessing, each is split into training set (60%) and testing set (40%).

5.2. Feature Selection

Feature selection is performed on each used for experimentation to select the optimal features for training the classifier. Selecting the optimal features from the will improve the classification accuracy. A wrapper approach that uses three bioinspired algorithms, namely, CSO, KH, and BFO with the accuracy of the SVM classifier is used to perform feature selection. An outline of CSO, KH, and BFO used for feature selection is presented below.

5.2.1. Outline of the CSO Algorithm for Feature Selection

CSO is inspired and modeled based on two main postures of cats, namely, resting and tracing. Mimicking the resting behavior of a cat is named as seeking mode, and mimicking the tracing behavior of a cat is named as tracing mode. The seeking mode relates to a local search process, whereas the tracing mode relates to a global search process. The vital parameters that play an important role in CSO are outlined in Table 4. Tracing mode relates to cat’s movement while chasing a prey, for example, chasing a rat.

The steps to select the optimal feature subset using CSO is outlined below (Algorithm 1):

Input: training set
Process:
Step 1: initialize the population of cats (solutions) at random. Each solution is of length , where represents the number of features. If the corresponding feature is selected, it is represented as “1;” else, it is represented as “0.” Initialize the parameters, namely, SMP, SRD, CDC, SPC, MR, C, and R.
Step 2: calculate the fitness value of each cat (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function. The solution that has the maximum fitness value obtained so far is considered as the best solution.
Step 3: assign the cats to perform seeking mode. Seeking mode refers to the cats at rest and its movement to the next position by looking around itself.
Step 3a: create (SMP) copies of the current cat. All the copies are considered to be candidate solutions.
Step 3a. i: if the value of SPC is true, one among the candidates retain the position, while the rest changes its position with respect to a randomly selected SRD.
Step 3a. ii: if the value of SPC is false, then all the candidates change their position by a randomly selected CDC.
Step 3b: calculate the probability of each solution being selected using Equation (2) to find the best solution that has the maximum chance to survive. If all the solutions produce the same fitness value, then the probability value is considered as “1.”
(2)
In the above formula, is the probability of the current cat , is the maximum fitness value, and is the minimum fitness value. The values of are assigned if maximum fitness has to be calculated. The values of are assigned if minimum fitness has to be calculated. In our work, the value of is assigned to .
Step 4: perform tracing mode. In this mode, the cats update their position based on the velocity. Calculate the velocity and update the position of each cat using Equation (3) and Equation (4).
(3)
(4)
In the above formula, are the position and velocities of current cat at iteration The best solution set from the cats in the population is denoted by ; denotes the dimension to be changed; is a constant, and is a random number between 0 and 1.
Step 5: update the best solution that has the maximum fitness value. If the solution in the previous iteration has low fitness value, then replace it with the current best solution; otherwise, retain the previous best solution.
Step 6: repeat step 2 to step 5 for a maximum number of iterations or until the convergence of solution is reached. The solution with the maximum fitness value obtained by the classifier is considered as the optimal feature subset.
Output: optimal feature subset.
5.2.2. Outline of the KH Algorithm for Feature Selection

The KH algorithm is centered on the imitation of the herding behavior of krill swarms with respect to precise biological and environmental processes. Krill density is reduced by predators, namely, seals, penguins, or seabirds. The herding of the krill individuals includes, one, increasing the krill density and two, reaching the food. The fitness function of each krill individual has been defined as the least distance of each individual krill from food and from the highest density of the herd.

Three vital actions considered to define the time-dependent position of an individual krill are one, movement induced by other krill individuals, two, foraging activity, and three, random diffusion.

Krill individuals attempt to maintain a high density and hence move due to their mutual effect. Local swarm density, target swarm density, and repulsive swarm density are used to estimate the direction of motion. Food location and prior experience about the food location are the two parameters used to estimate the foraging motion. Random diffusion is used for the exploration of the search space. In the KH algorithm, the population diversity is improved by means of the diffusion function, which is integrated into the krill individuals. Random diffusion is the net movement of each krill individual from high-density to low-density regions.

The motion velocity of krill particle applies the Lagrangian model [43] as shown in Equation (5).

In the above formula, is the motion velocity of krill particle , is the induced motion, is the foraging motion, and is the random diffusion of the krill individual. The vital parameters that play an important role in the KH algorithm are outlined in Table 5.

The steps to select the optimal feature subset using KH is outlined below (Algorithm 2):

Input: training set
Process:
Step 1: initialize the population of krill herds (solutions) at random. Each solution is of length , where represents the number of features. If the corresponding feature is selected, it is represented as “1,” else as “0.” Initialize the parameters maximum induced motion , foraging speed , maximum random diffusion speed , , , , and .
Step 2: calculate the fitness value of each krill herd (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function. The solution with the highest fitness value is considered as the global best solution.
Step 3: update the position of each krill using Equations (6) and (7) based on movement induced by other krill individuals, foraging activity, and random diffusion.
(6)
(7)
In the above formula, is the current position of the krill; is the scaling factor of the velocity vector; is the induced motion; is the foraging motion; is the random diffusion of the krill individual; is the step-length scaling factor; is the total number of krill individuals; is the upper bounds of variable , and is the lower bounds of variable .
Step 4: each krill individual maintains a high density and change their position due to their mutual effect. The direction of individual krill is maintained by target effect, local effect, and repulsive effect. The induced movement by other krills is calculated using Equations (8) and (9).
(8)
(9)
In the above formula, is the induced motion; is the maximum induction speed; is the induced direction; is the inertia weight of the motion induced; is the last induced motion; is estimated from the local effect, and is the target effect.
Step 5: calculate the foraging motion using Equations (10) and (11). It is mainly based on the current location of the food and the previous experience about the food location.
(10)
(11)
In the above formula, is the maximum foraging speed; is the foraging motion; is the inertia weight of the foraging motion; is the last foraging motion; is the food attractive, and is the effect of the best fitness of the krill.
Step 6: calculate the random motion for random diffusion using Equation (12) which is characterized with high diffusion speed and a random vector.
(12)
In the above formula, is the maximum random diffusion speed; is the random directional vector; is the current iteration number, and is the maximum number of iterations.
Step 7: repeat steps 2 to 6 for a maximum number of iterations or until the convergence of solution is reached. The solution with the maximum fitness value obtained by the classifier is considered as the optimal feature subset.
Output: optimal feature subset.
5.2.3. Outline of the BFO Algorithm for Feature Selection

The bacterial foraging optimization (BFO) algorithm imitates the pattern exhibited during the foraging process of Escherichia coli bacteria, that includes chemotaxis, swarming, reproduction, and elimination-dispersal operations [41]. The basic idea behind the foraging strategy of E. coli bacteria is to obtain the maximum nutrition in a unit time. The chemotaxis strategy involves the searching of nutrition by taking small movements such as tumbling, moving, and swimming, using its locomotory organ called flagella. The swarming strategy deals with the communication between bacteria. When the bacteria discover high amount of nutrients, they will release chemical substances to attract other bacteria. If they are in danger, they will tend to prevent other bacteria. The reproduction process involves splitting of healthier bacterium into two bacteria, and the low healthy bacteria are set to die. Finally, the elimination-dispersal strategy involves replacing the low health bacterium by randomly generated new ones. The vital parameters that play an important role in the BFO algorithm are outlined in Table 6.

The steps involved in finding the optimal feature subset using the BFO algorithm is outlined below:

Input: training set
Process:
Step 1: initialize the population of S bacteria (solutions) at random. Each solution is of length , where represents the number of features. If the corresponding feature is selected, it is represented as “1,” else as “0.” Initialize the parameters , , , , ,, and (where subscript in can range from 1,2,…S), and .
Step 2: calculate the fitness value of each bacterium (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function.
Step 3: in the elimination-dispersal process, due to environmental changes, the bacteria are eliminated or dispersed from current location. This process is used to strengthen the ability of global optimization. Initiate the elimination-dispersal process and increase the value of from 0 to .
Step 4: in the reproduction process, the low healthy bacteria die and rest of the other healthiest bacteria are divided into two bacteria. The new bacteria are placed on the same position of their parent. this process is used to maintain the population rate of bacteria. Initiate the reproduction process and increase the value of from 0 to .
Step 5: in the chemotactic process, the E. coli bacterium performs two actions during the entire life time, namely, tumble and swim. Initiate the chemotactic process and increase the value of from 0 to .
Step 6: execute the chemotactic process. Each bacterium moves into a chemotactic process, where
Step 6a: calculate the objective function for each bacterium using Equation (13).
(13)
Assign the value of objective function in to .
Step 6b: perform tumbling action for each bacterium using Equation (14). This action will enable the bacteria to change the present direction for a period of time.
(14)
In the above formula, is the maximum number iterations, and is a random forward direction of movement.
Step 6c: based on the tumbled direction obtained by the bacteria, each bacteria move to a random position using Equation (15).
(15)
In the above formula, is the bacterium position in the chemotaxis, reproduction, and elimination-dispersal procedure, and means the bacterium position in the chemotaxis, reproduction, and elimination-dispersal procedure.
Step 6d: compute the objective function value using Equation (16).
(16)
Step 6e: perform swim action and assign the value of swim length .
Step 6f: when the number of steps in the swim process is greater than the swim length (m), then increase the value to 1 If the value of replace the value using the current best objective value , then assign the swim length .
Step 7: if then go to step 5. Else, go to step 8.
Step 8: execute the reproduction process. In this reproduction process, the accumulated cost of bacterium is calculated using equation (17).
(17)
The accumulated cost of bacterium represents the health of the bacterium. Bacteria will be sorted in descending order based on the value. If the accumulated cost of bacterium is high, it means that the bacterium did not get enough nutrition or food during its entire lifetime. They are considered to have low health and set to die. The remaining healthy bacteria are divided into two. The reproduced bacteria are positioned at the same place as their parents.
Step 8a: if the number of defined reproduction steps is not achieved , then go to step 4.
Step 9: execute the elimination-dispersal process. Based on the elimination probability (), this process is used to keep the number of bacteria in the population unchanged. If a bacterium is eliminated, a random search is initialized to move to a new position to avoid local optimum, after a certain number of reproduction movements.
Step 10: repeat the process from step 3 to step 9 until the number of elimination dispersal steps is greater than the value of Otherwise, terminate the process.
Output: optimal feature subset.
5.3. Classifier Training

Each is preprocessed and split into training set and testing set ( 40%). A wrapper approach that uses three bioinspired algorithms CSO, KH, and BFO with the classification accuracy of SVM as the fitness function has been used for feature selection. The features selected by each bioinspired algorithm are used to train three BPNNs independently using CGA. The number of hidden layers for each BPNN is 1, and the activation function used in the hidden layer is sigmoid. The learning rate is 1–07, and the maximum number of iterations is 100. Since the classification is binary, each BPNN has only one output node, and the activation function used in the output layer is sigmoid. Figure 2 elaborates the process of training BPNN classifiers.

The number of training instances for FCSO, FKH, and FBFO classifiers is presented in Table 3. Though majority of the features selected by each bioinspired algorithm overlap, it has been inferred that the number of features selected by each algorithm is not the same. The parameter settings for each classifier is presented in Table 7.

The steps to train the BPNN classifier using three BPNN classifier and trained using CSO, KH, and BFO algorithms are outlined below:

Input: training set (FCSO, FKH, FBFO).
Step 1: initialize the parameters, namely, weights and bias, number of hidden layers, and learning rate of the BPNN.
Step 2: the number of hidden nodes are calculated using Equation (18).
(18)
In the above formula, is the number of hidden nodes, and is the number of input nodes.
Step 3: the input of the hidden layer is calculated using Equation (19).
(19)
In the above formula, is the input of the hidden layer; is the weights of each input nodes; is the bias.
Step 4: the output of the hidden layer is calculated using Equation (20).
(20)
where is the output of the hidden layer, and is the input to the neuron from the previous layer.
Step 5: calculate the error rate in the predicted output using Equation (21).
(21)
In the above formula, is the expected output, and is the obtained output.
Step 6: update the new weights and bias based on the learning rate and error rate using CGA.
Step 7: repeat the steps from 2 to 5 until the error rate converges.
Output: three BPNN classifiers trained using FCSO, FKH, and FBFO.
5.4. Classifier Testing and Dataset Construction for Super Learner

After training the classifier with 60% of the preprocessed , classifier testing is performed using the remaining 40% of the of the preprocessed . Figure 3 elaborates the process of testing the three classifiers and also throws light on the process of training the super learner.

Feature selection is performed on the testing set by querying the FCSO, FKH, and FBFO databases. The instances of the testing set containing the features selected by the CSO are used to test the FCSO classifier; similarly, the instances of the testing set containing the features selected by the KH and BFO are used to test the FKH and FBFO classifier. The performance of the FCSO, FKH, and FBFO classifiers are evaluated using the results obtained from the testing set.

The classification result of each instance of the testing set for FCSO, FKH, and FBFO classifiers and the class label corresponding to each instance of the testing set will be the candidate instances for training and testing the super learner.

5.5. Super Learner Training and Testing

As outlined in Section 5.4, the classification result pertaining to each instance of the testing set for FCSO, FKH, and FBFO classifiers and the class label corresponding to each instance of the testing set will be the candidate instances for training and testing the super learner. Figure 4 elaborates the process of training and testing of the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. The number of training and testing instances for the super learner is presented in Table 3.

Super learner is a type of ensemble classifier [44]. In this work, a BPNN classifier trained using CGA is used as the super learner. The parameter settings for the super learner are presented in Table 8.

The super learner is trained using the steps presented in Section 5.3 for training the BPNN classifier using CGA, and the performance of the super learner is evaluated using the testing set.

6. Results and Discussions

Seven clinical datasets from the UCI ML repository, namely, WDBC, SHD, HCC, HD, VCD, CHD, and ILP have been used for experimentation. The performance of the FCSO, FKH, and FBFO classifiers and super learner is evaluated in terms of accuracy, sensitivity, specificity, precision, and -score, which are calculated based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) using Equations (22), (23), (24), (25), and (26). In the above formula, TP is the number of positive instances predicted as positive by the classifier, TN is the number of negative instances predicted as negative by the classifier, FP is the number of negative instances predicted as positive by the classifier, and FN is the number of positive instances predicted as negative by the classifier.

Accuracy, sensitivity, specificity, precision, and score obtained using FCSO, FKH, and FBFO classifiers and super learner for the datasets WDBC, SHD, HCC, HD, VCD, CHD, and ILP are presented in Tables 915.

The super learner has achieved a classification accuracy of 96.83% for WDBC, 86.36% for SHD, 94.74% for HCC, 90.48% for HD, 81.82% for VCD, 84.0% for CHD, and 70.0% for ILP. The classification accuracy of the proposed work has been compared with the performance of the existing work on clinical datasets and the comparison results summarized in Table 16.

7. Conclusion and Scope for Future Work

A CAD system that employs a super learner to diagnose the presence or absence of a disease has been implemented in this work. Seven from the UCI ML repository, namely, WDBC, SHD, HCC, HD, VCD, CHD, and ILP have been used for experimentation. Each is preprocessed, and the preprocessed is split into training and testing sets. A wrapper-based feature selection approach using three bioinspired algorithms, namely, CSO, KH, and BFO, with the accuracy of SVM classifier has been used to select the optimal feature subsets. The selected feature subsets are used to train three BPNN classifiers using CGA, and the performance of the trained classifiers is evaluated. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The super learner achieved a classification accuracy of 96.83% for WDBC, 86.36% for SHD, 94.74% for HCC, 90.48% for HD, 81.82% for VCD, 84.0% for CHD, and 70.0% for ILP.

CAD systems to diagnose disorders in the human body from different imaging modalities such as X-ray, computed tomography, magnetic resonance imaging, and positron emission tomography are gaining importance. This work can be extended by developing CAD systems to diagnose disorders from the medical images acquired through different imaging modalities. Features based on shape, texture, and run length can be extracted from the images, and the feature selection algorithms used in this work can be used to select the relevant features. The relevant features can be used to build classifier models to predict the presence or absence of disorders from the images.

Data Availability

The data supporting this study are from previously reported studies and datasets, which have been cited. The datasets used in this research work are available at UCI Machine Learning Repository.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.