Abstract

Hepatitis disease is a deadliest disease. The management and diagnosis of hepatitis disease is expensive and requires high level of human expertise which poses challenges for the health care system in underdeveloped and developing countries. Hence, development of automated methods for accurate prediction of hepatitis disease is inevitable. In this paper, we develop a diagnostic system which hybridizes a linear support vector machine (SVM) model with adaptive boosting (AdaBoost) model. We exploit sparsity in linear SVM that is caused by regularization. The sparse -regularized SVM is capable of eliminating redundant or irrelevant features from feature space. After filtering features through the sparse linear SVM, the output of the SVM is applied to the AdaBoost ensemble model which is used for classification purposes. Two types of numerical experiments are performed on the clinical features of hepatitis disease collected from UCI machine learning repository. In the first experiment, only conventional AdaBoost model is used, while in the second experiment, a feature vector is applied to the sparse linear SVM before its application to the AdaBoost model. Simulation results demonstrate that the strength of a conventional AdaBoost model is enhanced by 6.39% by the proposed method, and its time complexity is also reduced. In addition, the proposed method shows better performance than many previously developed methods for hepatitis disease prediction.

1. Introduction

Hepatitis is considered a major chronic liver disease worldwide. The liver is considered to be the heaviest and one of the largest organs of the human body [1]. The liver is one of the key organs of a human body responsible for different functions. These functions include bile secretion, protein formation, and elimination of toxins from body. Hence, inflammation of liver (caused by hepatitis) results in dysfunction of the liver, and consequently, the health of the subject is deteriorated. The symptoms of hepatitis are different in different patients, with some subjects showing no signs. Well-known symptoms include yellowish eyes and skin, abdominal pain, poor appetite, and tiredness [2, 3]. Hepatitis can be acute or chronic depending on duration. If it lasts for less than six months, it is acute; however, if it lasts for more than six months, it is chronic [4]. It has been reported that hepatitis results in more than a million deaths each year. Diagnosis of hepatitis through conventional methods is a difficult job and requires expensive medical tests [5]. Additionally, the diagnosis of such disease through intelligent system reduces the cost and also examines the patient in shorter time. Hence, development of intelligent diagnostic systems for such type of disease prediction is very important.

In the past, numerous hybrid models for disease detection have been developed by different researchers. These include automated systems for Parkinson’s disease prediction [68], mortality prediction [9, 10], cancer detection [11, 12], and heart disease [6, 13, 14]. These models are developed by hybridizing data mining models (for feature preprocessing) such as principal component analysis (PCA) and Fisher discriminant analysis (FDA) with machine learning models such as decision trees, logistic regression, support vector machine (SVM), Naive Bayes, neural network models, ensembles of neural networks, K-nearest neighbors, deep neural networks, and optimized and stacked SVMs [1524]. For example, Adamczak developed different automated models for hepatitis prediction. These models include MLP + BP, RBF (Tooldiag), and FSM without rotation and achieved a prediction accuracy of 77.4%, 79%, and 88.5%, respectively [25]. In another study conducted by Passi, MLO was developed for hepatitis which resulted in hepatitis prediction of 79.70% [26, 27]. Stern and Dobnikar developed AIS, LDA, and FDA models which achieved the hepatitis prediction accuracy of 82%, 84.5%, and 86.40%, respectively [27]. Nilashi et al. developed KNN, ANFIS, NN, and SVM and achieved hepatitis prediction accuracy of 71.41%, 79.67%, 78.31%, and 81.17%, respectively [28]. Recently, Polat and Gunes discussed the hybridization of the feature extraction through the principal component analysis model with classification through artificial immune recognition system for the prediction of hepatitis disease [1, 29].

In this paper, we develop a hybrid intelligent diagnostic system. To improve the strength of AdaBoost predictive model, we propose to use -penalized linear SVM. The penalty makes the linear SVM sparse, thus making it capable of eliminating redundant features by making their coefficients zero through sparse solutions. After elimination of redundant features through the sparse linear SVM, the remaining features are supplied to the AdaBoost model for classification. In order to analyze the impact of the sparse linear SVM on the AdaBoost model, we performed two types of numerical experiments. In the first experiment, we developed the conventional AdaBoost model, while in the second experiment, we constructed a learning system by stacking the sparse SVM with the AdaBoost model. The performance of both the models, developed in the two experiments, was evaluated using an online hepatitis disease data. Experimental results demonstrated that the sparse linear SVM enhances the accuracy of conventional AdaBoost (for the hepatitis disease prediction based on the collected clinical features). Additionally, the sparse linear SVM also reduces AdaBoost model’s complexity as the optimal subset of features contains less number of features.

The rest of the manuscript is organized as follows. Datasets, the proposed sparse linear SVM, and AdaBoost-based learning system are elaborated in Section 2. Section 3 discusses various schemes for validation as well as multiple metrics for evaluation used in the manuscript. Section 4 discusses experimental setup and obtained results, whereas the last section concludes the paper.

2. Materials and Methods

2.1. Dataset Description

The hepatitis dataset consists of 155 samples, and each sample contains 19 features. Details about the 19 commonly used features for the hepatitis dataset are given in Table 1. The label of the dataset is binary, i.e., it can have a value of 1 or 2, where 1 means the sample belongs to a patient who died, while 2 means the sample is that of a subject who survived. There are 32 samples having label 1 and 123 samples having the label value of 2, i.e., the dataset contains 123 samples belonging to healthy class and 32 samples belonging to patient class. In machine learning, we split the data into two parts, namely, training and testing. The training part is used to train the model, and its performance is checked by testing the trained model on the testing data. In this study, the dataset is divided into training and testing datasets using 70–30 data portioning. Hence, out of the 155 samples, 108 samples are used for training purposes, and the remaining 47 samples are used for testing purposes. Out of the 108 training samples, 23 samples belong to the patient class, and 85 patients belong to healthy class. On the other hand, out of the 47 testing samples, 7 samples belong to the patient group, and 38 samples belong to the healthy group. It can be noticed that lower class distribution of the patient class is a limitation of the dataset.

2.2. Proposed Method

As discussed above, in this paper, we exploit the sparsity in linear SVM to improve the strength of machine learning models, namely, k-nearest neighbours (KNN), Gaussian Naive Bayes (GNB), linear discriminant analysis (LDA), and AdaBoost ensemble model. Initially, -penalized linear SVM is used to generate sparse features, i.e., to process the full set of features, null the redundant features, and yield a subset of features containing relevant features only. The generated subset of features by sparse linear SVM is supplied to machine learning models for classification purposes. The sparsity of the linear SVM is controlled by its hyperparameter . Hence, for distinct values of , various distinct features will be nullified resulting in different subsets of features. Thus, for achieving better hepatitis prediction accuracies, it is necessary to develop a sparse linear SVM that would nullify the most redundant or irrelevant features and generate a subset of the most relevant features. This can be accomplished by tuning the hyperparameter . In order to better comprehend the functioning of the proposed learning system, it is pertinent to briefly discuss the -penalized linear SVM model and its formulation. The formulation is as follows.

Support vector machines (SVMs) are considered powerful learning methods and have been widely used in different biomedical- and health informatics-related problems [30]. During the training process, SVM tries to construct an optimal hyperplane that can better differentiate the data points of the two classes (in case of binary classification) [31]. The major reason that motivates machine learning researchers to use SVM for their problems is that SVMs have powerful generalization capabilities to unseen data and they depend on very small number of hyperparameters [32].

Considering a dataset with instances , where stands for instance, represents the dimension of the original feature space of hepatitis data, and denotes the class labels, i.e., presence or absence of hepatitis disease. The value is 19 for the hepatitis dataset considered in this paper. The SVM model determines a hyperplane calculated by , where represents the bias and denotes the weight vector. Based on the training data, the hyperplane of SVM augments the margin, whereas it curtails the classification error [33]. The sum of the distances between the closest negative and closest positive instances is called margin. In other words, the hyperplane augments the margin distance .

SVM uses a set of slack variables denoted by , and a penalty parameter, i.e., , and attempts to maximize and minimize the errors of misclassification [34]. This fact is formulated as follows:subject to , where is the slack variable that calibrates the degree of misclassification and Euclidean norm or -norm is the penalty term. A varied version of SVM was introduced by Bradley and Mangasarian which replaces the Euclidean norm, i.e., -norm with -penalty function [35]. The -penalized SVM produces sparse solutions and has the feature selection property due to its competence of overthrowing irrelevant or noisy features automatically and hence can be used for feature selection. The formulation of -penalized SVM is given as follows:

From the above formulas, it can be seen that, for different settings of the hyperparameter of the SVM, i.e., , different features will be nulled; consequently, a different subset of features will be produced [36]. The goal is to tune the value of in such a way to produce a subset of features which will show best performance in terms of hepatitis disease prediction accuracies. This is done by using exhaustive search methodology. After production of the features’ subset, its application to AdaBoost machine learning models is carried out. The AdaBoost model is used for classification task.

AdaBoost (also known as adaptive boosting classifier) is an ensemble learning model. It utilizes boosting approach to construct a metaclassifier by combining the strengths of base classifiers, i.e., weak estimators. The boosting operation helps convert the weak estimators into a stronger or boosted model. During the process of boosting, weighted sum of the base learners or estimators is evaluated to produce the final output of the boosted model. This fact is reflected in the following formulation:where the mth base classifier is denoted by and denotes the weight of the mth classifier or estimator. To implement the AdaBoost model, we used scikit-learn python API [37]. In the following discussion, denotes the total number of classifiers or estimators used for constructing the eventual AdaBoost model.

The primary objective of this paper is to investigate and exploit the sparsity in the linear -regularized SVM to further improve the strength of the AdaBoost model. To meet this objective, we develop a cascade of the linear sparse SVM and AdaBoost model. The full feature set is supplied at the input of SVM which produces different subset of features based on the value of its hyperparameter . Performance of the subset of features is evaluated by their application to AdaBoost model. Thus, in the initial stages, we need to discretize the hyperparameter. After discretization of , we will have to search the optimal value of that will produce optimal subset of features which will show best classification performance. The whole process of the proposed method is shown in the Figure 1. From the figure, it can be seen that initially, a subset of features is generated by utilizing a specific value of . The subset of features is given to the AdaBoost model which is trained using one value of E. For the subset of features, performance is evaluated under optimal E. Furthermore, another subset of features is generated by utilizing another discrete value of , and again the AdaBoost model is trained and evaluated under optimal value of E. The process is repeated until all the subset of features are evaluated and tested. At the end, the optimal subset of features is selected based on the performance.

3. Evaluation of the Proposed Method

In literature, different researchers have utilized various metrics for performance evaluation of their proposed methods. However, for a more realistic evaluation of the performance of our proposed method, we utilized the following five evaluation metrics known as accuracy (ACC), specificity (Spec.), sensitivity (Sen.), and Matthews correlation coefficient (). Accuracy gives information about the total number of correctly classified subjects (whether healthy or patients). Specificity conveys information about the number of healthy subjects which are classified correctly. Similarly, sensitivity represents the percentage of subjects which are classified correctly. is used to measure the quality of binary classification. The basic formulas for these metrics are given as follows:

4. Results and Discussion

In this section, the experimental setting and the obtained results are analyzed and discussed. All the experiments (including conventional machine learning-based experiments and the proposed method-based experiments) are performed using Python software (scikit-learn). The experiments were simulated using Intel Core i5 processor with 8 GB RAM and 64-bit operating systems. For the purpose of comparison, we performed two types of experiments. First, the conventional AdaBoost model is developed for the prediction of hepatitis disease. Second, the proposed hybrid model is developed to predict hepatitis disease based on the filtered set of features.

4.1. Simulation of Conventional AdaBoost Model on Hepatitis Data

In this experiment, we develop the conventional AdaBoost model for the hepatitis disease data. The model is trained using 70% of the dataset and tested on the remaining 30% of the data. An exhaustive grid search algorithm is used to search the optimized version of the AdaBoost model. The results on both optimal hyperparameters and nonoptimal hyperparameters are given in Table 2. It is evident from the table that best performance of 82.97% accuracy, 11.11% sensitivity, 100% specificity, and MCC of 0.302 is obtained at optimal hyperparameter, i.e., .

4.2. Simulation of the Proposed Method Using the Sparse Linear SVM and AdaBoost Model on Hepatitis Data

In this experiment, the proposed learning system is developed by using both the models, i.e., sparse linear SVM and AdaBoost model. The simulation results are reported in Table 3. As can be seen in the table, different values of for the sparse SVM generate different subsets of features with different sizes. For subset of features with sizes from N = 1–10, no improvement in the performance is observed. However, from onwards, we see changes in performance of the system. It is evident from the table that best performance of 89.36% is obtained at , i.e., with subset of features having only 16 features. However, the best performance on full feature set, i.e., on conventional AdaBoost is 82.97% which is shown in the last row of the table. Hence, it can be observed that coupling the conventional AdaBoost model with sparse linear SVM model improves the performance by 6.39%.

To statistically analyze the results on the testing data, we utilize confusion matrix. As discussed above, the dataset is divided into training and testing datasets using 70–30 data portioning. Hence, out of the 155 samples, 108 samples are used for training purposes, and the remaining 47 samples are used for testing purposes. Out of the 108 training samples, 23 samples belong to the patient class, and 85 patients belong to healthy class. On the other hand, out of the 47 testing samples, 7 samples belong to the patient group, and 38 samples belong to the healthy group. The predicted results of the proposed L1 SVM-AdaBoost model are depicted statistically in the confusion matrix in Figure 2.

To further show that the coupling of the sparse linear SVM with conventional AdaBoost model enhances the performance of conventional AdaBoost model, we use AUC. The AUC in case of conventional AdaBoost model is 0.587, while AUC in case of the proposed method is 0.649. Hence, the ROC charts further validate the fact that the coupling of the sparse linear SVM enhances the performance of AdaBoost for hepatitis disease data.

4.3. Comparison of the Proposed Method with Some Other Proposed Methods Applied to Hepatitis Data

The above discussion validates that the learning system proposed in this paper significantly augments the strength of the conventional AdaBoost model. In this section, the effectiveness of the learning system thus developed is further validated by carrying out a comparison of its performance with some of the well-known models presented in previous studies. The prediction accuracies and brief details about the models are given in Table 4. It is evident that our proposed method promises better performance upon 23 other machine learning models.

By analyzing Table 4, it can be seen that previous methods have exploited various machine learning-based methods to improve the hepatitis disease prediction accuracy. For example, Stern and Dobnikar developed methods based on discriminant analysis (including linear discriminant analysis and quadratic discriminant analysis) and could achieve a classification accuracy of 85.8% with quadratic discriminant analysis. Similarly, Ozyildirim and Yildirim developed a number of models for searching out optimum model with better classification accuracy. They obtained the highest classification accuracy of 83.75% using radial basis function (RBF). Moreover, if we analyze the results tabulated in Table 4, the previous methods have carried out analysis of their proposed method by only considering classification accuracy. In this paper, we analyzed the results of the proposed hybrid method with a number of metrics and proved the robustness of the proposed method from two key metrics, i.e., classification accuracy and area under the curve (AUC).

4.4. Limitations of the Study

Although this paper demonstrated the effectiveness of exploitation of sparsity in feature space to improve the performance of the machine learning models, the main limitation is lower sensitivity rate. This is due to the low representation of the patient class in the dataset. The main limitation of the hepatitis disease dataset is its imbalanced nature. The dataset has uneven class distribution, i.e., out of 155 samples, 123 samples belong to the healthy class, and 32 samples belong to the patient class. Recent research pointed out that machine learning models trained under such imbalanced classes show biased performance against the minority class (i.e., the models show very poor performance on the minority class) [40]. On the other hand, the models are biased towards the majority class, i.e., the models will show very good performance on the majority class. In case of the hepatitis disease dataset, the minority class is the patient class, and the majority class is the healthy class. From the results, it can be seen that the majority class has 100% detection accuracy (i.e. 100% specificity) while the minority class has poor detection accuracy, i.e., 44%. In future studies, we need to collect balance datasets, i.e., having the same representation for both the classes. Machine learning models trained under such balanced scenario are supposed to show better sensitivity. Moreover, the exhaustive search method for hyperparameters optimization is time-consuming. In future, application of metaheuristic algorithms [41, 42] should be explored.

5. Conclusion and Future Work

This work developed an automatic hepatitis disease detection system by using machine learning methods. The AdaBoost model was developed for the hepatitis disease prediction. To improve the classification strength of the AdaBoost model, sparsity in the linear SVM model was exploited. The SVM model eliminated redundant or irrelevant features and thus improved the prediction accuracy of the AdaBoost model. It was also shown that the proposed sparse linear SVM also proves helpful in decreasing the time complexity of the AdaBoost model. Moreover, as evident by the simulation results, our proposed method surpassed many previously published methods in terms of hepatitis disease prediction accuracy. Given the experimental quantitative figures and results, it can thus be safely concluded that the proposed methodology can also be exploited to improve performance of other machine learning models and thus can help to make quality decisions in various other disease detection problems as well.

As discussed above, although the proposed method can be used as a tool to improve the performance of machine learning models, the obtained accuracy still needs considerable amount of improvement. Thus, in future studies, more robust cascaded models should be developed by using deep learning approaches for classification. Additionally, the low rate of sensitivity that is caused by lower class representation of the patient class in the dataset is also a limitation of the study that should be considered as an open challenge for the future work. In future studies, extended hepatitis disease datasets should be collected that will have balanced class distribution.

Data Availability

All the data used in this study are available at the UCI Machine Learning Repository.

Conflicts of Interest

The authors declare that they have no conflicts of interest.