Abstract

Imbalanced class distribution in the medical dataset is a challenging task that hinders classifying disease correctly. It emerges when the number of healthy class instances being much larger than the disease class instances. To solve this problem, we proposed undersampling the healthy class instances to improve disease class classification. This model is named Hellinger Distance Undersampling (HDUS). It employs the Hellinger Distance to measure the resemblance between majority class instance and its neighbouring minority class instances to separate classes effectively and boost the discrimination power for each class. An extensive experiment has been conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three state-of-the-art undersampling models. The outcomes display that HDUS can perform better than other models in terms of sensitivity, F1 measure, and balanced accuracy.

1. Introduction

Classification is a standard data mining process. It consists of two steps, building a model and testing a model. A classification model is built to learn from training data which is then tested to predict the category of unknown samples. Most classification algorithms were mainly built to classify the balanced dataset, whereas a problem occurs when a dataset is imbalanced, which degrades the recognition power of the classifier [1]. The imbalanced problem appears when the data is composed of very different sample numbers for the various classes; i.e., the number of samples of one class is greater than those in the second class, the former is called the majority class, and the latter is called the minority class [2]. Imbalanced datasets usually influence the classification process. If the problems of imbalanced class distribution are not addressed before implementing the classification procedures, the classifier appears to be biased towards the majority class cases while ignoring to classify the minority class cases correctly [3]. However, the problems of classifying imbalanced data often occur in real-life applications such as analyzing medical datasets, where the cases of patients with the disease are significantly lower than those without the disease. For instance, in cancer detection, the cases of patients diagnosed with cancer are much smaller than those of patients who do not have cancer [4]. The classification model to predict cancer results in lower classification performance of abnormal class and incorrect prediction disease which leads to serious health risk.

In general, the problem of classifying imbalanced data is due to the lack of training with a few minority class samples which are inadequate to predict accurately [5]. Previous studies have proposed resampling techniques to solve the problem of class imbalance. These techniques are mainly categorized into oversampling and undersampling [6]. The oversampling method is aimed at generating samples for the minority class [7], and the undersampling method is aimed at reducing samples for the majority class [8].

In this work, we propose a novel undersampling technique, named the Hellinger Distance Undersampling (HDUS) model, aimed at solving the imbalanced classification problem in medical datasets. The proposed model reduces healthy class samples to improve the classifying performance of the rare disease class. It adopts the Hellinger distance to measure the similarity between majority class instance and its neighbouring minority class instances, then chooses a number of the highest Hellinger distance values, and sums them up to be a similarity value for each majority instance. Finally, the model selects a subset from the majority instances, having top similarity values, and combined with the original minority class instances. This model could effectively separate major class instances and minor class instances and boost the discrimination power for each class, thereby improving the classification accuracy for rare class. We compared the HDUS with four models, including a baseline model without using any sampling technique and three state-of-the-art undersampling models. The experiment was conducted on four imbalanced medical datasets using three classifiers.

The remainder of this paper is organized as follows. Section 2 mentions a related work. Section 3 presents the proposed model. Section 4 reviews the experiment setup. Section 5 presents the results of the experiment. Section 6 demonstrates the discussion of the results. Section 7 is the conclusion of the study.

Recently, the problem of imbalanced classification has drawn much attention in the literature because the traditional classification algorithms were not initially built to train the imbalanced dataset [9]. This problem usually emerges from the different distribution of classes in the feature space. Furthermore, there are some other problematic features of imbalanced datasets such as overlapping samples, small disjoints, and small sample sizes. The overlapping denotes to the data samples in various classes that overlap in the feature space. The small disjoint denotes to the few samples in the minority class that are spread separately in the feature space. Finally, the small sample sizes refer to an insufficient number of data samples in the minority class. The aforementioned imbalanced features would raise the complexity of the classifier, which in turn makes it difficult to classify the minority class samples correctly [5, 10].

To overcome the imbalanced data problem, current approaches may be categorized to the algorithm level and data level. The first group works to change the classification algorithm, to support the minority class cases, by specifying weights to cases from various classes or by ensemble methods [11, 12]. The second group applies before the classification procedure to modify the distribution of imbalanced dataset through data sampling techniques [13].

Previous studies indicated that solving the imbalanced problem at the data level is simple and efficient for unbalanced classification [1]. Therefore, data sampling techniques have been widely used to alleviate the unbalanced classification problem by modifying the distribution of classes in the training dataset. Generally, sampling techniques are categorized into over- and undersampling [14]. The oversampling technique is aimed at generating instances artificially for a minority class by adding copies of already existing data from minor class instances [7]. Many methods of oversampling have been applied earlier. Random oversampling (ROS) is a common oversample approach that randomly adds samples to the minor class. Although ROS adjusts the class distribution, it may increase the overfitting problem by making similar copies of the minor class that influence the classification process [14]. Another standard oversample approach is the synthetic minority oversampling technique (SMOTE) [15]. It is used to generate artificial samples. Unlike ROS, SMOTE avoids the overfitting problem, but it may cause the overlapping with the surrounding samples that increase the overall training data size and hinder the training process [16, 17]. Generally, with the oversampling technique, the problem of an imbalanced class is diluted, but the training data is going to get more crowded. Therefore, the classification performance is affected [18].

Undersampling is another reasonable data sampling technique which attempts to reduce the number of samples in the majority class. The undersampling concept is how to eliminate majority class instances in a manner that retains the practical distinction among classes [8]. Numerous undersampling methods have been implemented and used earlier. The most naive approach is random undersampling (RUS), which eliminates instances from major class randomly. It tends to balance the distribution of classes but causes waste of valuable information that could be essential for the classification process [14]. Tomek link (Tml) is another undersampling method used to address the overlapping problem. It looks for pairs of samples belonging to different classes but are each other’s nearest neighbour and eliminates the majority sample of the pair [19]. Another method is the edited nearest neighbour, which is applied to eliminate major class samples based on the nearest neighbour that belongs to the minor samples. When the number of neighbours in each major class is higher in the minor class, the major class sample shall be omitted as noise or borderline [20].

Previous research studies revealed that there is no optimal rule to attain the best fit with over- or undersampling. They have shown that usually undersampling process of the major class is used to outperform the results obtained through the oversampling of the minor class [15]. More than that, as the data size has been increasing, the undersampling method would be a better option than the oversampling method [21]. Instance selection was used in previous studies to remove the outlier from the training dataset, which can make the classifier perform better than the original dataset [2224]. However, the existing instance selection techniques have programmed to choose a portion of the initial dataset which cannot be used directly to choose instances from just one class of the dataset, such as selecting from the major class instances. Kubat and Matwin in [25] proposed one-sided instance selection to remove noise samples, redundant samples, and borderline samples from the majority class while keeping the original samples belonging to the minority class.

Recently, a lot of undersampling methods have been reported in the literature to improve the imbalanced data classification. Tsai et al. in [10] introduced an undersampling method by clustering the majority class into groups of similar data samples; then, the instance selection extracts the nonrepresentative data samples from each group. Nwe and Lynn in [20] suggested an undersampling approach began by determining the closest major class neighbours to each minor class sample, then evaluating the number of correlation of each neighbour from the major class with the minor class samples. Finally, the required number of major class instances is taken from the number of correlations. Besides, the authors in [26] adopted the one-sided undersampling technique. They proposed a method for reducing the major class size that modifies the distribution of initial imbalanced classes by measuring the similarities of each major class case with the corresponding minor class cases. The method effectively separates the major and minor class cases to optimize the identity value for each class.

3. Proposed Model

The work is aimed at providing a method that handles the problem of imbalanced data distribution which affects classification performance of minority class samples. In the imbalanced dataset, the class with a larger number of instances takes up most of the space. Unequal class distribution makes the classifier to be inadequately qualified to classify the smaller class instances, and the class with a larger number of instances overlaps the identification ability of the class with a smaller number of instances. In this case, the classifier would favour the majority class instances and scoring false high accuracy.

In this work, we proposed an undersampling model by following the principle of one-sided selection to extract instances from the major class, while the data in the minor class will remain without change. This is based on the premise that it is better to keep the instances of a minor class as real as they are, in such a manner that no greater or no less quantity is exercised on them. So, the classifier will be provided by an accurate recognition power for the original minor class samples.

Instance selection in the undersampling technique depends on how to select majority class instances in a manner that retains the compatible distinction among classes. In our proposed model, we used Hellinger distance (HD) [27, 28] to choose instances from the major class based on their Hellinger similarity degree with the minor class instances. Hellinger distance is a measure of the variance in distribution [29]. In [30], Cieslak et al. demonstrated analytically that HD is very robust in the presence of a skew distribution of class and it is not affected by the class imbalanced rate due to its isometric contours. This is the motivation of using HD in our proposed method. To express the equation of Hellinger distance, let and be two probability functions; then the HD between and can be expressed as follows:

Considering the problem of classification of the imbalanced class dataset and being motivated by the properties of the HD, we proposed an undersampling model using Hellinger similarity measure. The proposed model works to reduce the number of major class instances, aimed at upgrading the prediction performance of minority class which is the class of teh highest interest in medical datasets. Algorithm 1 the pseudocode of the proposed HDUS model:

Input: Imbalanced Training dataset (ITrD)
Output: Balanced Training dataset (BTrD)
1 Group the ITrD according to the classes
2  C1= ITrD (class1) //C1 indicates the minor class which contains less number of instances
3  C2= ITrD (class2) //C2 indicates the major class which contains more number of instances
4  For i in rows of (C2)
5   For j in rows of (C1)
6    Simi,j = calculate the similarity between C2(i) and C1(j) using Hellinger Distance
7     append Simi,j To HD(i)
8   Next j
9    select m top values from HD (i) // where m is a given number of neighbouring minority class
10    HDsum(i)= sum the selected m top values
11   Next i
12 C2HD=select w majority class instances according to the highest similarity value in HDsum(i),
  // where w is a given number
13 return (BTrD= C2HD +C1)
Algorithm 1: Hellinger Distance Undersampling (HDUS) pseudocode

4. Experiment Setup

In this section, we display the details of the experiment to test the proposed HDUS model. We present the nature of the datasets, the used classification algorithms, the evaluation metrics, and the undersampling methods used for comparison.

The code for the whole experiment was conducted in Python Programming language and spyder tools using the available utilities to provide all the necessary preprocessing and classification techniques besides the evaluation functions.

4.1. Datasets

In this work, we have exercised four imbalanced medical datasets to evaluate the performance of the suggested (HDUS) model. For each dataset, the number of features (attributes), the number of instances, the number of majority cases, and the number of minority cases are presented. These datasets are described in the following.

4.1.1. A Novel Colorectal Cancer Dataset (CRC)

This dataset is from the Southampton University Hospital and has been used with approval from the responsible surgeon (co-author), and the data are all anonymous. The data are for patients having primary cancer at 12 colorectal sites, who then have cancer resection surgery. There are 1005 instances (patients), each of which acts as a record of a single patient with 14 features (attributes), including the target label. Out of 1005 instances, 760 are for patients having primary CRC who do not have metastasis, representing the majority samples, and another 245 cases are for patients having primary CRC growing to metastasis in other organs of the body, representing the minority samples. The data type is categorical (groups into multiple categories) and mapped to numeric values. Table 1 shows the features of colorectal cancer dataset.

4.1.2. PIMA Indians Dataset

The dataset of PIMA Indians was taken from the UCI machine learning repository [31]. It has nine features, including the class feature. The class feature indicates if there are patients with diabetes or not. The dataset has 768 samples, including 268 having diabetes (the minority samples) and 500 without diabetes (the majority samples). The information of features is shown in Table 2.

4.1.3. Thoracic Surgery Dataset (THS)

The thoracic data was taken from the UCI machine learning repository [31]. This data was collected from patients who experienced tumour resections for primary lung cancer. The dataset has 17 features, including the class feature. It has 470 samples, including 70 patients who died during the one year after surgery (the minority samples) and 400 who are alive (the majority samples). The information of features is shown in Table 3.

4.1.4. Breast Cancer (BC) Dataset

The BC dataset was taken from the UCI machine learning repository [31] which is provided by the Oncology Institute. It has ten features, including the class feature. The dataset indicates if a breast cancer recurred or not. The dataset has 286 samples, including 85 cases of the minority class and 201 cases of the majority class. The information of features is shown in Table 4.

4.2. Classification Data Mining Algorithms

In this study, three classification algorithms with different characteristics were explored: decision tree (DT), Support Vector Machine (SVM), and -Nearest Neighbour (KNN). The primary purpose of using these classifiers was to evaluate the performance of the proposed model on four imbalanced medical datasets. The experiment was initially done on a CRC dataset and then tested on three datasets selected from the UCI repository.

4.2.1. -Nearest Neighbour (KNN)

KNN is a classification technique that relies on feature similarity measures to find the closest neighbours. For the classification of a new point, the KNN reviews each training sample as a tuple () with the particular label denoting its class. KNN counts the spaces between and all training tuples, then specifies to the maximum repeat class in the nearest tuple [32].

4.2.2. Support Vector Machine (SVM)

SVM is a supervised kernel-based classification algorithm that can be used for binary classification problems. It uses a mathematical function to define an optimal hyperplane that splits two classes in a training dataset with a maximum margin. Then, SVM increases the space between the closest training data points (support vectors) and the class boundaries trying to find the optimal hyperplane that removes some insignificant data from the training data set. However, when the data is intrinsically nonlinear, SVM will use kernel function to construct a separating hyperplane that transforms the data from the original dimension into a high-dimensional space. Popularly used kernel functions are the linear, polynomial, sigmoid, and Gaussian kernel [33, 34].

4.2.3. Decision Tree (DT)

A decision tree classifier involves several simpler decisions to build a tree model. The tree model builds three types of nodes: root, internal, and leaf. The root represents the starting point which has no incoming edges but outgoing edges. The internal nodes are represented by the data attribute, which has only one incoming branch and at least two leaving branches for each possible attribute. The leaf nodes are represented by the classes. These patterns of the decision tree express sets of if-then rules that can be employed to classify novel samples [35].

4.3. Evaluation Metrics

A classifier is, typically, evaluated by a confusion matrix which contains four values from classification outputs that report the number of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The TP refers to the rate of correctly recognizing the rare positive. The TN refers to the rate of correctly recognized negative. The FP refers to the rate of frequent negative incorrectly recognized as rare positive, and the FN refers to the rate of rare positive incorrectly recognized as frequent negative. In the experiment, the minority class refers to positive and the majority class refers to negative. The most used performance measure of classification tasks is accuracy. However, it is not an appropriate metric when evaluating the imbalanced class distributions because the classifier has a strong bias towards the majority class and fails to classify the few samples of minority class [36].

More proper metrics could be used to assess the performance measurement of classifying imbalanced class distribution, such as sensitivity or recall (True positive rate (TPR)), specificity (True negative rate (TNR)) [37], precision (positive predictive value (PPV)) [32], F1-measure [35], and balanced accuracy (BACC) [38].

These metrics are given by equations in (2) as follows:

To ensure an unbiased evaluation of the models, the -fold cross-validation is used as an evaluation criterion. In -fold cross-validation, the data were divided into equal folds, then the model was trained on all folds except one fold as a validation set on which the prepared model was tested. The process repeats so that each fold gets an opportunity to act as the test set. Then, the -test outcome was averaged [35]. In our work, the value is set to 5.

4.4. Comparative Method

To allow a fear valuation of the validity of our proposed method, HDUS is compared against three other undersampling methods: (i)Tomek link (Tml): it is aimed at removing the noise and border points from majority class instances by examining pairs of samples belonging to different classes but are each other’s nearest neighbour and eliminates the majority sample of the pair [19].(ii)Random undersampling (RUS): it eliminates instances from major class randomly until the desired balance of class distribution is achieved [14].(iii)Edited nearest neighbour (ENN): the basic idea of ENN is to eliminate samples of the major class based on the -Nearest Neighbour that belong to the minor samples. If the number of neighbours is predominant in each majority instance from minority instances, certain instances of the majority class are eliminated as overlap instances [20].

5. Results Analysis

To investigate the performance measures of the proposed HDUS method, we used four imbalanced medical datasets using three classification algorithms including DT, SVM, and KNN and they were compared with the baseline model (without any resampling method) and with three state-of-the-art undersampling methods (Tomek link, RUS, and ENN). The results of the four datasets (CRC, PIMA, THS, and BC) are shown in Tables 58, respectively, in terms of sensitivity, specificity, precision, F1 measure, and balanced accuracy.

As shown in Tables 58, the first column of the baseline model confirms that the imbalanced classification problem exists in all used datasets. This presents a low average rate of sensitivity to predict minority class; it ranges from 7.94% in the THS dataset to 39.7% in the PIMA dataset, while it assigns a high average rate of specificity to predict the instances of the majority class.

The 2nd, 3rd, and 4th columns of the tables represent the result of used undersampling methods: Tomek link, RUS and ENN, respectively. We can observe the improvement achieved by using these methods, as expressed by the values of sensitivity that reflects the ability of the models to detect the class of interest, i.e., the minor class. Although the Tomek link obtained worse performance in all datasets, it is better than the baseline model except in the THS dataset, which scored lower than the baseline model.

More improvement is achieved in the 5th column of all tables by the proposed HDUS method in terms of sensitivity, F1_m, and Bacc. It can be observed that the HDUS performance shows significant improvement over the baseline and the three undersampling methods. The HDUS results in the top rate of sensitivity overall datasets (that refers to the highest ability to detect the class of interest, i.e., the minority class). It scores over 80% in both CRC and PIMA, near 70% in BC and near 60% in THS which is the lowest sensitivity. It also results in the highest rate for both F1_m and balanced accuracy.

6. Discussion

This study discussed about a preprocessing undersampling method named HDUS. The model handles the class inequality problem in medical datasets to improve the prediction performance of the minority class samples by using the instance selection based Hellinger distance similarity measure.

It is crucial to refer to the need for handling the issue of class inequality by choosing appropriate approaches that address the skewed distributions of data. As noted in the previous section, the baseline classification of original datasets shows a very high value of specificity to predict the majority class samples but a very poor sensitivity to predict the minority class samples, which is the class of interest in the imbalanced medical datasets. The use of traditional undersampling techniques shows good progress, mainly by RUS. However, using RUS seems to be not convenient since it eliminates meaningful samples randomly and can also cause overfitting due to the expanding of scar samples without limitations [26]. The performance of ENN is lower than that of RUS except in PIMA dataset, and Tml is the worst one in the experiment. However, the proposed HDUS method has proved to overcome all the other methods in the experiment for all datasets due to the robust measure of Hellinger distance which has the property of skew intensive that is not affected by the class imbalance [30].

To simplify the comparison among the different undersampling methods used in the experiment and to evaluate their efficiency, Figures 13 provide a graphical representation for the average values of (sensitivity, F1_m, and Bacc) resulting from the five models applied on four imbalanced medical datasets. As can be seen from the figures, the performances vary when different undersampling techniques are utilized. From Figure 1, it is evident that our HDUS method has made good progress in predicting minority class samples on all datasets in terms of sensitivity. The similar situation can be found through Figure 2 for F1_m, which is the trade-off between precision and recall, and Figure 3 for Bacc, which is the trade-off between sensitivity and specificity.

Regarding the classification methods, it is worth to remark that the benefit of carrying out classification increases when the class imbalance issue is appropriately addressed. In our experiment, different classification algorithms may benefit from the adoption of the HDUS model. In particular, DT achieves the best performance with HDUS. It also obtained the best results with Tml and ENN, whereas SVM is more appropriate for the RUS method. As shown in Table 9, the average results of used classifiers with experimented undersampling methods in the four datasets achieved improvement in predicting the minority class samples through sensitivity, F1_m, and Bacc.

Finally, the results of the proposed HDUS model should be considered as a preliminary experiment, but a promising method in the application of undersampling the imbalanced medical dataset to improve the classification performance of minor class samples.

7. Conclusion

This paper proposed a novel model, HDUS, that handles the imbalanced classification problem in the medical datasets to improve the classification of the minority disease class. HDUS works to reduce the majority class instances by using the Hellinger distance to calculate the similarity between majority class instance and minority class instances. Then, HDUS selects a subset from the majority class instances having the highest similarity values that are shown to perform well in combination with the original minority class instances. The experiment was conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three selective undersampling models. The performance results show that HDUS could achieve significant improvement over the selective models in terms of sensitivity, which is highly desirable in the medical domain, F1_measure, and balanced accuracy. HDUS has proved to be a promising model for rebalancing the imbalanced medical datasets which contain a few but important cases of disease class.

In a future work, we encourage comparing HDUS with other sampling techniques for the same classifiers or using other classifiers or even utilizing a larger number of medical datasets with different characteristics. We also suggest integrating the proposed model with other sampling techniques to handle the imbalanced classification problem in medical datasets.

Data Availability

The Colorectal Cancer Dataset is not publicly available; it is from the Southampton University Hospital and has been used with approval from the responsible surgeon (co-author), and the data are all anonymous. The datasets “PIMA”, “Thoracic surgery”, and “Breast Cancer” are openly available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.