Introduction

In machine learning problems, high dimensional data, especially in terms of many features, is increasingly these days [1]. Many researchers focus on the experiment to solve these problems. Besides, to extract important features from these high dimensional of variables and data. The statistical techniques were used to minimize noise and redundant data. Nevertheless, we do not use all the features to train a model. We may improve our model with the features correlated and non-redundant, so feature selection plays an important role.

Moreover, it not only supports in training our model faster but also lowers the complexity of the model, makes it easier to understand and improves the metric performance in accuracy, precision, or recall. There are four important reasons why feature selection is essential. First, spare the model to reduce the number of parameters. Next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. In the field of data processing and analysis, the dataset may be large of variables or attributes which determine the applicability and usability of the data [2]. Also, the challenge for classification is to pay attention to balance and imbalance data [3]. Another motivation is to get the best model with high predictions and small errors [4, 5].

The reduction of the original feature that set to a smaller one is preserving the relevant information while discarding the redundant one, and it is referred to feature selection (FS) [6, 7].To solve this issue, we have to use a smaller number of training samples. The use of feature selection and extraction techniques would be the highlight of this case. Feature selection methods are often used to increase the generalization potential of a classifier [8, 9]. In this paper, we compare the result of the dataset with and without important features selection by RF methods varImp(), Boruta, and RFE to get the best accuracy. In the heart of machine learning, it requires lots of data, features, and variables to make predictions and reach high accuracy. More than that, selecting the feature is more important than designing the prediction model. Furthermore, using the dataset without pre-processing will only make the prediction result worse.

Related to the previous research, [10] performs feature importance in classification models for colorectal cancer cases phenotype in Indonesia. Besides, these features as covariates in future genetic association studies of colorectal cancer [11] conduct feature importance on emotion classification and emotional speech synthesis. Also [12, 13], performs feature importance analysis for the industrial recommendation system with promising results. In this paper, we show how significant the features selection in Bank Marketing dataset, car evaluation dataset, and Human Activity Recognition using smartphones dataset.

The main contributions of this research summarize as follows. First, it analyses various features to find out which features are useful, particularly for the classification data analysis. These studies have been implemented with Random Forest. Some discussions are presented to get several concepts into the selection of the critical metric. Second, the system shows the comparison of the different machine learning models, such as RF, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Linear Discriminant Analysis (LDA) based on the critical features. Different models will have various strengths in data classification that will affect the classification performance. Besides, we use multiple features selection methods, RF varImp(), Boruta, and RFE, to get the best accuracy. Further, we mainly review the features selection application, provide a description, analysis, and future research suggestions.

The remainder of the paper is organized as follows. “Material and method” section provides a review of the Materials and methods. “Results and discussion” section presents our results and discussion. Finally, conclusions and future research directions are indicated in “Conclusion and future work” section.

Material and method

Important features study

Variable importance analysis with RF has received a lot of attention from many researchers, but there remain some open issues that need a satisfactory answer. For instance, Andy Liaw and Matthew Wiener using RF for classification and regression problems, they use R language to solve the problem [14]. Other research combines RF and KNN on the HAR dataset using Caret [15]. Moreover, in [16] introduced RF methods to Diabetic retinopathy (DR) classification analyses. These research results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate their progression. Hence, Grömping [17] compares the two approaches (linear model and random forest) and finds both striking similarities and differences, some of which can be explained whereas others remain a challenge. The investigation improves understanding of the nature of variable importance in RF. RF has been discussed as a robust learner in several domains [18, 19]. Feature selection aims at finding the most relevant features of a problem domain. It is beneficial in improving computational speed and prediction accuracy [20]. In [21], a comparative analysis using Human Activity Recognition (HAR) dataset based on machine learning methods with different characteristics is conducted to select the best classifier among the models. This study showed that the RF approach has high precision from each category and is considered the best classifier [22]. Further, the combination of RF, SVM (Support Vector Machine), and tune SVM regression to improve the model performance could be found in [23]. The experiment describes that the best features to improve model performance are essential [24]. The feature selection is handy for all disciplines, more instance in ecology, climate, health, and finance. However, Table 1 describes in detail the application of feature selection.

Table 1 Description application of feature selection

The evaluation of function in variable and feature importance is dependent by model use information, or the model does not use information. The advantage of using a model-based approach is more closely tied to the model performance and that it may be able to incorporate the correlation structure between the predictors into the importance calculation. In brief, the importance is calculated. Each predictor will have a separate variable of importance for each class. Next, all the important measurements are scaled to have a maximum value of 100, unless the scale argument of varImp()should be set to FALSE.

In this experiment, the model-specific metrics Random Forest from the R package were used. For each tree, the prediction accuracy on the portion of the data is registered. Then the same is finished after permuting each predictor variable. The difference between the two accuracies is then averaged over all trees, and normalized by the standard error. We use train()function the desired model using the caret package. Then, use the varImp()function to determine the feature importance by RF.

Recursive Feature Elimination (RFE) offers an accurate way to define the prominent variables before we input them into a machine learning algorithm. Guyon et al. [74] proposed RFE, which is applied to cancer classification by using SVM. RFE employs all features to build an SVM model. Next, it ranks the collaboration of each feature in the SVM model into a ranked feature list. RFE then finally eliminates the unrelated features that have a meaningless contribution to the SVM model. Moreover, RFE is a powerful algorithm for feature selection, which depends on the specific learning model [75, 76].

Boruta is a feature selection algorithm and feature ranking based on the RF algorithm. Boruta’s benefits are to decide the significance of a variable and to assist the statistical selection of important variables. Besides, we can manage the strictness of the algorithm by adjusting the p value that defaults to 0.01. maxRun is the number of times the algorithm is run. The higher the maxRun, the more selective we get in choosing the variables. The default value is 100. For the confirmation of feature selection, our experiment has followed the Boruta package in the R programming language [77]. This package is based on the wrapper, which builds around the RF classification algorithm, and works on the RF method to determine significant features. It tries to capture all the interesting and important features in each dataset that have an outcome variable. This algorithm performs a top-down approach for relevant features with the comparison on the set of original attributes.

Classifiers method

Random Forests (RF) consists of a combination of decision-trees. It improves the classification performance of a single tree classifier by combining the bootstrap aggregating method and randomization in the selection of data nodes during the construction of a decision tree [78]. A decision tree with M leaves divides the feature space into M regions Rm, 1  m  M. For each tree, the prediction function f(x) is defined as:

$$f\left( x \right) = \mathop \sum \limits_{m = 1}^{M} c_{m } \varPi \left( {x,R_{m} } \right)$$
(1)

where M is the number of regions in the feature space, Rm is a region appropriate to m; cm is a constant suitable to m:

$$\varPi \left( {x,R_{m} } \right) = \left\{ {_{0, \quad\text{otherwise}}^{{1, \quad if \, x \epsilon R_{m} }} } \right.$$
(2)

The last classification conclusion is made from the majority vote of all trees.

K-Nearest Neighbor (KNN) [79, 80] works based on the assumption that the instances of each class are surrounded mostly by instances from the same class. Therefore, it is given a set of training instances in the feature space and a scalar k. A given unlabelled instance is classified by assigning the label, which is most frequent among the k training samples nearest to that instance. According to many different measures that are used for the distance between instances, the Euclidean distance is the most frequently worn for this purpose [81]. Some of the previous researches about KNN could be found in [82,83,84]. The type of distance metric used in this method is Euclidean distance described in the equation below:

$$L\left( {x_{i} , x_{j} } \right) = \left( {\mathop \sum \limits_{i, j = 1}^{n} (\left( {\left| {x_{i} - x_{j} } \right|} \right))^{2} } \right)^{{\frac{1}{2}}} X \in R^{n}$$
(3)

Linear Discriminant Analysis (LDA) [85] usually used as a dimensionality decrease technique in the pre-processing step for classification and machine learning applications. The goal is to project a dataset into lower dimensional space with good separable class—to avoid over-fitting and to reduce computational costs. LDA is usually used to discover a linear combination of features or variables. The combination is beneficial for dimensionality reduction. LDA yields scattered classes from the fixed dataset. It is due to the distance between the training data in a class that is made shorter [86]. The purpose of LDA is maximizing the between-class measure while minimizing the within-class measure. Let Ci be the class containing the state binary vectors x corresponding to the ith activity class. Then the linear discriminant features are performed in the following way. It consists of solving the generalized eigenvalue problem:

$$L \, = \, Eig \, (S_{W}^{ - 1} S_{B} )$$
(4)

With the between-class scatter matrix, SB and within-class scatter matrix \(S_{W}^{ - 1}\) are calculated [87]. The number of reduced variables will be at most N-1 because there only N points to estimate SB.

Support Vector Machines (SVM) is a machine learning algorithm. In recent years, there has been plenty of researches introduce SVM as a powerful method for classification. An overview can be found in [88,89,90,91] and can be used to regression [30, 92]. The other research describes that SVM uses a high dimension space to find a hyperplane in order to perform binary classification where the error rate is minimal [93, 94]. The problem with SVM is to separate the two classes with a function obtained from the available training data [36, 95, 96]. The aim is to produce classifiers that will work well on other problems. The input vectors are maximal to separate two regions that are the hyperplane function in SVM. SVM is not limited to separate two kinds of objects and that there are several alternatives to dividing lines that arrange the set of objects into two classes. This technique seeks to find an optimal classifier function that can separate two sets of data from two different categories. In this case, the separating function aimed is linear.

$$g\left( x \right) = sign\left( {f\left( x \right)} \right)$$
(5)

With \(f\left( x \right) = \varvec{w}^{T} \varvec{x} + b, \varvec{w},\varvec{x} \in \varvec{R}^{n}\) and b ∈ \(\varvec{R}\), w and b are the parameters for which value is sought. The best hyperplane is located in the middle between two sets of objects from two classes. Finding the best hyperplane is equivalent to maximizing the margin or distance between two sets of objects from two categories. Samples located along a hyperplane are called support vectors. In this technique, it is attempted to find the best classifier/hyperplane function among functions.

Classification and Regression Training (Caret) Package

The Caret package has several functions that arrange to streamline the model building and evaluation process. This package consists of 30 packages and contains functions to shorten the model training process for classification and complex regression problems. Moreover, Caret will execute packages as needed and assumes that they are installed. If a modelling package is missing, there is a prompt to install it. The package accommodates tools for data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation, as well as other functionality [97, 98]. A classification tree algorithm is a nonparametric approach. This method is a one classification method that does not depend on certain assumptions and able to explore complex data structures with many variables. The data structure can be seen visually [99]. Moreover, the classification tree algorithm also enables it to interpret the results easily.

Random Forest is divided into two, regression trees and classification trees. When an RF is used for classification, it is more accurate to call it a classification tree. When it is used for regression, it is known as a regression tree. The classification tree in the response variable is categorical data, whereas, in the regression tree, the response variable is continuous data. Classification trees are rules for predicting the class of an object from the values of predictor variables. Trees are formed through repeated data sealing, in which the level and benefits of the predictor variables of each observation in the sample data are known. Each partition (split) data is expressed as a node in the tree formed.

Research workflow

Figure 1 describes the workflow of this research. The experiment consists of several steps. First, collecting the dataset from the University of California Irvine (UCI) machine learning repository. Further, this work uses three popular datasets (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. Second, our work applies features selection method RF, Boruta, and RFE to select essential features. The next is the comparison of different machine learning models such as RF, SVM, KNN, and LDA methods for classification analysis. The determination of an ideal subset of highlights from a list of capabilities is a combinatorial issue, which cannot be understood when the measurement is high without the association of specific suspicions or bargain that results in just problematic arrangements. Here our experiment utilizes a recursive methodology to move toward the issue. Different models will have different strengths in classification data analysis. We will compare four classifiers method with various features to select the best classifiers method based on the accuracy of each classifier. The whole work has been done in R [97, 98] a free software programming language that is specially developed for statistical computing and graphics.

Fig. 1
figure 1

The workflow of this research

Model performance evaluation

The performance is evaluated based on the calculation of accuracy. Accuracy is how often the model trained is correct, which depicted by using the confusion matrix. A confusion matrix is the summary of prediction results on a classification problem [100]. A classification system is expected to be able to classify all data sets correctly, but the performance of a classification system is not entirely spared error. The form of error is in classifying new objects into a class (misclassification). The confusion matrix is a table recording the results of classification work.

The confusion matrix in Table 2 has the following four results [101]. True positive is a condition when the observations coming from positive classes are predicted to be positive. Then, False-negative is a condition when the actual observation comes from a positive but in positive negative predicted class. False-positive is a condition when the actual observation coming from negative classes but predicted to be positive. Lastly, True negative is a condition when observations from negative classes are predicted to be negative. The performance evaluation in classification can be justified by precision and recall. Recall/True Positive Rate can be defined as the level of accuracy of predictions in positive classes and the percentage of the number of predictions that are right on the positive observations. Moreover, accuracy is the percentage of overall predictions that are right on all observations in the data group. Apart from looking at the confusion matrix, the assessment of the goodness of a classifier’s prediction can be seen from the Receiver Operating Characteristic (ROC) [102, 103] and Area Under the Curve (AUC) curves [104].

Table 2 Confusion Matrix

Based on the contents of the confusion matrix, it can be seen the amount of data from each class is correctly predicted and classified incorrectly. Then calculate the accuracy and prediction error rates using the equation below: [105]

$$Accuracy = \left( {TP + TN} \right)/\left( {TP + TN + FP + FN} \right)$$
(6)
$$Precision = \left( {TP} \right)/\left( {TP + FP} \right)$$
(7)
$$Recall = \left( {TP} \right)/\left( {TP + FN} \right)$$
(8)

where: TP = True positive; FP = False positive; TN = True negative; FN = False negative.

Cohen’s Kappa evaluation is an evaluation method to determine the reliability or level of similarity in two or more variables. The equation from the Cohen’s Kappa evaluation can be written in Eq. (9) as follows:

$$k = \frac{{p_{0} - p_{e} }}{{1 - p_{e} }}$$
(9)

With: k = kappa coefficient value,\(p_{0}\) = total main diagonal proportion of the observation frequency, \(p_{e}\) = total marginal proportion of the observation frequency. The value of the cohen’s kappa coefficient can be interpreted with the strength of agreement: First, poor ≤ 0.20; fair = 0.21–0.40; moderate = 0.41–0.60; good = 0.61–0.80; very good = 0.81–1.00.

Results and discussion

Dataset descriptions

This experiment uses three datasets publicly available from the UCI machine learning repository. Moreover, the three datasets belong to classification data that have different total instances and features. The description of each dataset could be found in Table 3.

Table 3 Dataset descriptions

Table 3 describes a dataset that belongs to classification data. In this experiment, we use the Bank marketing dataset published in 2012 with 45,211 instances and 17 features. Next, the car evaluation database in 1997 with 1728 instances and six features, and Human Activity Recognition Using Smartphones Dataset in 2012 with 10,299 instances and 561 features. The ability to mine intelligence from these data more generally, big data has become highly crucial for economic and scientific gains [106, 107]. Further, feature descriptions and explanations for each dataset could be seen in Tables 4, 5, 6, and 7.

Table 4 Feature description bank marketing dataset
Table 5 Feature description car evaluation dataset
Table 6 Feature description human activity recognition using smartphones dataset (3-axial signal in the X, Y, Z)
Table 7 Feature description human activity recognition using smartphones dataset (variables from the signal)

The set of variables estimated from the 3-Axial signal in the X, Y, and Z can be seen in Table 6. Additional vectors obtained by averaging the signals in a signal window sample can be seen in Table 7.

Features selection by RF, Boruta, and RFE for Bank Marketing Dataset displayed in Figs. 2, 3, 4, and 5. First, in RF, the process of solving at each parent node is based on the goodness of split criterion, which is based on the function of impurity. The solving rule used is the towing criterion. The goodness of split is an evaluation of solving by s at node t. A split s in node t is divided into \(t_{R}\) with the proportion of the number of objects. Then, \(i\) function with \(t_{R}\) has probability \(P_{R}\) and with \(t_{L}\) has probability \(P_{L}\). In addition, \(P_{L}\) with the number of objects in \(t_{L}\) can be defined as \(P_{L}\)(decreasing impurity). It means that the solution is done to make two new vertices with a smaller (homogeneous) diversity when compared to the initial node (parent node). Solving the t node using split s will produce a new classification tree that has a tree impurity. This value is smaller than the tree impurity from the previous classification tree.

Fig. 2
figure 2

The important measure for each variable of Bank marketing Dataset using Random Forest

Fig. 3
figure 3

The important measure for each variable of Bank marketing Dataset using Recursive Features Elimination

Fig. 4
figure 4

The important measure for each variable of Bank marketing Dataset using Boruta

Fig. 5
figure 5

Feature selection and classification method combination for Bank Marketing Dataset a RF + RF, b RF + SVM and c RF + KNN

$$\varPhi \left( {s,t} \right) = \Delta i\left( {s,t} \right) = i\left( t \right) - P_{R} i\left( {t_{R} } \right) - P_{L} i\left( {t_{L} } \right)$$
(10)

The breakdown criteria are based on the greatest value of the goodness of split [\(\varPhi \left( {s,t} \right)]\). Discrete attributes only have two branches for each node, so that every possible value for the node must be partitioned into two parts. Each combination forms a candidate splits an alternative that will be selected to compile partition initials on root nodes and other nodes based on the highest goodness of split values. Before performing the goodness of split in continuous attributes type, the attribute must find the threshold to calculate the goodness of split in attributes. Split-points are obtained by looking for the average value of 2 attribute values that have been sorted first. On a continuous type attribute, the case is labelled with an attribute value less than or equal to the threshold value (A ≤ v) and attribute, which has a more significant value than the threshold value (A > v).

Bank marketing datasets

This dataset uses seven predictors and two classes (No and Yes) with 36,170 samples. In Random Forest, re-sampling is used by using cross-validation ten folds, and the best accuracy is at mtry = 2. It means that we take two random variables from our data set and examine them for one tree. Therefore, from the next tree would be taken two more random variables, examine them, so on and so forth until it runs through the numbers that we specify and then return the average estimates for the best/most important variables and justify by kappa (0.3444818).

Figure 2 explains that seven variables are important to be used, including duration, balance, age, poutcomesucess, pdays, campaign, and housingyes. Then the variable will be used to form the model. Our research operates cross-validation to see the accuracy of each of these variables, which can be seen in Fig. 3 and perform the Boruta in Fig. 4.

Moreover, these experiments perform KNN, -tested with k = 5, 7, and 9, which resampling using cross-validation tenfold. It obtained k = 9 is best used with an accuracy value of 0.8841308 and kappa 0.2814066. Then do the same thing in SVM by comparing the C cost (0.25,0.50, and 1) obtained the best accuracy value at C = 1 with sigma 0.2547999 reach the accuracy 0.8993641 and kappa 0.355709. Finally, we perform LDA with tenfold cross-validation that obtained accuracy 0.898037, and kappa 0.4058678. These experimental results are fully explained in Tables 8 and 9.

Table 8 Classification accuracy of different classifiers with bank marketing dataset
Table 9 Statistics by the class of different classifiers with bank marketing dataset

Figure 5 displays the selection of 7 features based on RF + RF, RF + SVM, and RF + KNN. The KNN accuracy will increase when using neighbors values that are getting bigger. Then in the random selection of predictors, the best is the predictor with a large number. Furthermore, in RF + SVM, the best accuracy is to use a cost that is close to 1.

Car dataset

At the simulation stage of the Car Dataset in Random Forest, we apply 1384 samples, 4 predictors, and 4 classes (acc, good, unacc, vgood). Next, the resampling stage was mtry (2, 7, and 12). Besides, the best result is mtry = 7, with an accuracy of 0.9436328 and kappa 0.8784367. Moreover, In modeling with KNN, the optimal model is obtained by k = 5 with an accuracy of 0.7969389 and kappa 0.5683084. Furthermore, the SVM resampling cross-validation 10 fold and the tuning parameter “sigma” was held constantly at a value of 0.07348688, C = 0.5 reach the accuracy 0.8346161, and kappa 0.6319634. Lastly, LDA achieves accuracy = 0.8431124, and kappa = 0.6545901 are fully explained in Tables 10 and 11. Features selection by RF, Boruta, and RFE for Car Evaluation Dataset could be seen in Figs. 6, 7, and 8.

Table 10 Classification accuracy of different classifiers with car evaluation dataset
Table 11 Statistics by class of Different Classifiers with Car Evaluation Dataset (4 features)
Fig. 6
figure 6

The important measure for each variable of the Car Evaluation dataset using Random Forest

Fig. 7
figure 7

The important measure for each variable of the Car Evaluation dataset using RecursiveFeatures Elimination

Fig. 8
figure 8

The important measure for each variable of the Car Evaluation dataset using Boruta

Figure 9 portrays the selection of 4 features based on RF + RF, RF + SVM, and RF + KNN. In this case, the greater choice of the attribute does not guarantee to reach high accuracy. This is proven by the final value used for the model RF + RF was mtry = 7. However, in RF + SVM tuning parameter, sigma was held constant at a value of 0.07348688. Accuracy was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.07348688 and C = 0.5.

Fig. 9
figure 9

Feature selection and classification method combination for Car Evaluation Dataset a RF + RF, b RF + SVM and c RF + KNN

Human activity recognition using Smartphones dataset

In this session, we perform HAR dataset by Random Forest, KNN, SVM, and LDA by 5884 samples, six classes (Laying, Sitting, Walking, Walking Downstairs, Walking Upstairs). The best model in Random Forest selects the largest value mtry = 2 with accuracy = 0.9316768 and kappa = 0.9177446. Features selection by RF, Boruta, and RFE for Human Activity Recognition Using Smartphones Dataset could be seen in Figs. 10, 11, and 12. Random Forest restores a few proportions of variable significance. The most dependable measure depends on the diminishing of arrangement exactness when estimations of the variable in a hub of the tree are permuted haphazardly. To choose highlights, we iteratively fit irregular Random Forest, at every emphasis fabricating another iteration disposing of those factors with the littlest variable significance.

Fig. 10
figure 10

The important measure for each variable of Human Activity Recognition Using Smartphones Dataset using Random Forest

Fig. 11
figure 11

The important measure for each variable of Human Activity Recognition Using Smartphones Dataset using Recursive Features Elimination

Fig. 12
figure 12

The important measure for each variable of Human Activity Recognition Using Smartphones Dataset using Boruta

Figure 11 illustrates the Random Forest for creating a classification tree. This processing is recursive partitioning, which means the solving process is repeated for each child node as a result of previous solutions. This solving process will continue until there is no chance to do the next solution. The term partition means that the sample data owned is broken down into smaller parts or partitions.

Figure 12 describes the important measure for each variable of the HAR dataset. Boruta performed 99 iterations in 1.04146 h. In this process, 404 attributes confirmed important: V1, V10, V100, V101, V103, and 399 more, 58 attributes confirmed unimportant: V102, V107, V111, V128, V148 and 53 more, and 100 tentative attributes left: V104, V105, V110, V112, V115 and 95 more. This work employ varImp(fit.rf) function to generate important features by RF. Next, to select important features by RFE, our experiment uses RFE function with various parameters such as rfeControl(functions = rfFuncs, method = ”cv”, number = 10). Moreover, we use TentativeRoughFix(boruta_output) function to select significant features by Boruta. Besides, in KNN, we perform (k = 5,7,and9). The final value used for the model was k = 7 with accuracy = 0.9036328 and kappa = 0.8839572. SVM resampling results across tuning parameters (C = 0.25, 0.50 and 1). Tuning parameter ‘sigma’ was held constantly at a value of 1.194369, and accuracy was applied to select the optimal model using the largest value. The final values used for the model were sigma = 1.194369, C = 1 with accuracy = 0.8708287, and kappa = 0.8444160. Lastly, LDA resampling cross-validation10-fold reached the accuracy = 0.8303822 and kappa = 0.7955373. Tables 12 and 13 describe the full experiment result with Human Activity Recognition Using Smartphones Dataset.

Table 12 Classification accuracy of different classifiers with human activity recognition using smartphones dataset
Table 13 Statistics by the class of different classifiers with human activity recognition using smartphones dataset (6 features)

Figure 13 represents the selection of 6 features on RF + RF, RF + SVM, and RF + KNN. Exactly similar to the car dataset, the best predictor is 2 in the HAR dataset, so the selection of many predictors does not guarantee high accuracy. The RF + SVM result is the selection of cost = 1, which will improve accuracy accordingly. Finally, for RF + KNN, the selection of the best neighbor appears to be 7.

Fig. 13
figure 13

Feature selection and classification method combination for Human Activity Recognition Using Smartphones Dataset a RF + RF, b RF + SVM and c RF + KNN

Evaluation performance and discussion

The contributions of the simulation paper are to see the different insights in each experimental data such as Bank Marketing dataset in Tables 8 and 9, car evaluation dataset in Tables 10, and 11 as well as human activity recognition using smartphones dataset in Tables 12 and 13. We perform 80% of training data and 20% testing data in each experiment. To compare the accuracy, this work is following metric=”Accuracy.”At the same time, we are comparing the accuracy from different classifiers method by following trainControl(method = ”cv”, number = 10), and different method parameter to do the experiment (method = ”lda”, method = ”knn”, method = ”svmRadial”, and method = ”rf”). The determination of the hyperplane function for classification in this study is done by optimizing margins.

Additionally, the problem is formulated into Quadratic Programming (QP) by completing an optimization function. Optimization function is simplified by transformation into the Lagrange function. This function creates a hyperplane that separates data according to every class. The calculation is intended to find the value of Lagrange Multiplier (α) and b value. The error values are obtained in each classification performance measurement with several pairs of parameter values (C parameters and kernel parameters). The values tried to determine which pair of parameter values is best in the classification of this study. The following is the error value obtained for each pair of amounts of the cost (C) parameter and kernel parameters that have been predetermined. Other than including determination methodology, in [107] additionally portrayed the best approach to error rates. Furthermore, in [108] investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. To evaluate the expectation mistake error of all methods we use the bootstrap strategy as proposed by Efron and Tibshirani [109]. Their experiment shows that a particular bootstrap method substantially outperforms cross-validation in a catalogue of 24 simulation experiments. Besides providing point estimates, it also considers estimating the variability of an error rate estimate [110]. The bootstrap strategy utilizes a weighted normal of the re-substitution mistake (the blunder when a classifier is applied to the preparation information) and the mistake on tests is not used to prepare the indicator.

Tables 8, 10, and 12 describe the result of the classification accuracy of different classifiers with different features selection method Boruta, RFE, and RF. The result shows that the RF method has high accuracy in all experiment groups. According to Table 8, the RF method has a high accuracy of about 90.88% with all features (16 features) and 90.99% accuracy with 7 features. Moreover, in Table 10, the RF method leads to 93.31% accuracy with 6 features and 93.36% accuracy with 4 features. In regards to the next experiment result in Table 12, the RF method gained 98.57% accuracy with 561 features and 93.26% accuracy with only 6 features. In general, the trend of accuracy will decrease because of features limitation. We could get good accuracy if we select the important features by the feature’s selection method. Random Forest in data mining is prediction models that are applied to describe the forms of classification and regression models. Decision trees are utilized to identify the most likely strategies to achieve their goals. The use of the Random Forest is a widespread technique in data mining in addition to get high accuracy RF + RF. The favors of using decision trees as a classification tool include: (1) RF is easy to understand. (2) The RF can handle both nominal and continuous attributes. (3) The RF represents enough discrete classification values. (4) RF is included in nonparametric methods, so they do not require distribution assumptions.

Lately, the fame of big data exhibits some difficulties for the traditional feature selection task. Meanwhile, some unique characteristics of big data also bring about new possibilities for feature selection research [111]. The latest advances in feature selection are a combination of feature selection with deep learning especially the Convolutional Neural Networks (CNN) for classification tasks, such as applications in bioinformatics neurodegenerative disorders classification using the Principal Components Analysis (PCA) algorithm [112, 113], brain tumor segmentation [114] using three planar super pixel based statistical and textural features extraction. Next, remote sensing imagery classification using a fusion of CNN and RF [115], and software fault prediction [116] using enhanced binary moth flame optimization as a feature selection, and text classification based on independent feature space search [117].

Conclusions and future work

In this paper, we compare four classifiers method Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). We combine those classifiers method with different features selection method RF, RFE, and Boruta to select the best classifiers method based on the accuracy of each classifier. Feature selection is essential for classification data analysis and proves in the experiment. Besides, Tables 8, 10, and 12 demonstrate that the RF method has high accuracy in all experiment groups.

Regarding the performance evaluation in our experiment, it is undoubtedly accurate that Random Forest it the best classifier. Furthermore, in all experiments with three different dataset method, varImp()by RF become the best features selection method compared to Boruta and RFE. Besides, RF methods are extremely useful and efficient in selecting the important features, so we should not use all the features in the dataset. Consequently, it will affect the processing time, it could give the best accuracy, and more features which are the higher dimension of data. Based on our evaluation result, our proposed model has a better result compare to other methods in each dataset. For instance, in Table 12, the RF method got 98.57% accuracy with 561 features and 93.26% accuracy with only 6 features.

In the future, we would like to set up our dataset or different data repositories and use a different method. At the same time, future research can try a QUEST. QUEST stands for Quick, Unbiased, and Efficient Statistical Tree. QUEST is one of the classification tree methods that produce two nodes per block. The variable that used as a node blocker is the variable with the smallest p value. The variable selected as a node blocker is utilized to define a block as a data split into two nodes. Also, future research can try the Gradient boosting, and the other boosted algorithm family can improve the predictive accuracy of the model. Some different boosting algorithms, such as XGBoost [45], AdaBoost [118], and Gentle Boost [119, 120] has its mathematical formula and varied. The concept of Gradient Boosting lies in its development which has expansion adds to the criterion fitting.