Incomplete data ensemble classification using imputation-revision framework with local spatial neighborhood information
Introduction
Classification is one of the most important tasks in machine learning (ML) [1], [2]. In recent years, it has been widely applied to many fields, such as biometric recognition [3], text classification [4], medical diagnosis [5] and so on [6], [7]. With the rapid development of technologies such as sensor technology, information technology, the acquisition of data is becoming increasingly diverse. Data plays an extremely important role in the current digital economy and brings great opportunities for the development of ML.
It is worth noting that the integrity of data is essential for most of the existing ML algorithms. However, in practice, the loss of information is inevitable due to various reasons such as damage of the storage equipment, failed pixels, limited capacity of data acquisition equipment, unanswered questions in surveys and so on. For example, 45% of the UCI datasets contain missing values (MVs) [8], and it is common in microarray data [9], [10], mobile phone data [11], visual data [12], industrial data [13] and software project data [14], etc. Since most existing data analysis methods are designed for complete data, the existence of MVs result in inapplicability of most existing data analysis methods.
Although there are a few algorithms such as C4.5 [15] that can handle MVs directly, the classification performance will be greatly decreased when the original dataset contains a large number of MVs [16], [17]. Many strategies have been proposed to solve the problem of incomplete data classification. The simplest strategy is case-deletion [18], which ignores (or discards) the samples contain MVs and only considers the impact of complete samples’ information in the original dataset. However, this strategy may lead to insufficient samples of investigation due to the loss of useful information. In addition, many sophisticated methods are proposed to improve incomplete data classification performance, such as Robust Bayes Classifier(RBC) [19], rough-set based methods [20], [21], [22] and three-way decision based methods [23], [24], etc.
In addition to the above methods, MVI is a much more popular strategy to deal with incomplete data which replaces the MVs with estimated ones [25], [26], [27]. In the past decades, numerous MVI methods have been developed and proved their own distinctive superiorities in different scenarios, most of them are based on statistic knowledge and machine learning [28], [29], [30], [31], [32], [33]. These methods often share a similar idea of imputing MVs in incomplete tuples through their complete neighbors in the data space. For example, mean imputation (MEI) [34] replaces all missing values in each feature with the mean of corresponding complete attribute; KNN-based imputation (KNNI) [35] selects k samples which are most similar to the samples with MVs by calculating the Euclidean distance and then imputes each MV with the corresponding complete attribute values of the k samples. Besides, classical MVI methods such as median imputation (MEDI) [34], SOFT imputation (SOFTI) [36], Matrix factorization imputation (MFI) [37] and other variant of above-mentioned methods such as W-KNNI [38] have been proposed to handle MVs.
However, these methods are all single imputation methods that generate a single value for each missing value. Rubin et al. [39] proposed the idea of multiple imputation. Different from single imputation, multiple imputation creates several complete datasets by imputing different values, those datasets were analyzed separately and then combined to boost a final result. MICE [40] is one of the typical multiple imputation methods. Researches show that the classification accuracy of the incomplete dataset can be improved when utilize MVI methods to convert MVs to complete ones before classification [41]. However, existing MVI methods seldom consider the local spatial neighborhood information which is proved to be effective in many application scenarios [42], [43], [44], [45]. In addition, few of them can combine multiple strengths of various MVI methods. How to integrate the advantages of various MVI methods and revise MVs according to the local spatial neighborhood information should be investigated.
Ensemble learning is a powerful technique in improving learning performance which constructs multiple classifiers instead of single classifier for classification [46]. Studies on using ensemble learning to deal with incomplete data can be roughly divided into two categories: (1) Ensemble classification methods for incomplete data which construct a set of classifiers according to a group of data subsets without any imputation [47], [48], [49], [50]. (2) Utilize an imputation strategy (single imputation or multiple imputation) on incomplete data and then conduct ensemble learning on the complete dataset. The former faces a fatal problem, which is most of the existing ensemble classification methods do not perform well when the original dataset contains a large number of MVs [47], [48]. For the second strategy, some techniques such as feature selection, random sampling are used to obtain a set of different complete datasets in the training process and then a group of classifiers will be learned from these datasets [51].
In this paper, the proposed framework advocates to integrate the advantages of different MVI methods which have been proven that they all have distinctive superiorities in different scenes. To be specific, it selects some classical MVI methods to pre-fill the original incomplete dataset, thus multiple complete datasets are obtained. Unlike most established imputation methods, the proposed framework imputes the incomplete dataset with several imputation strategies to obtain a group of complete datasets at first, and then it revises the missing values on each complete dataset according to the local spatial neighborhood information (LSNI) separately. Finally, it conducts ensemble classification on those revised datasets.
The rest of this paper is organized as follows. Section 2 briefly reviews some related works. The proposed method is presented in Section 3. Numerical experiments were conducted to validate the effectiveness of the proposed method in Section 4. Section 5 concludes the paper.
Section snippets
The study of the effectiveness of MVI techniques
Acuna and Rodriquez [52] compared four methods to deal with the problem of incomplete data classification, including case-deletion and other three MVI methods (e.g. Mean imputation, Median imputation and KNN-based imputation), then linear discriminant analysis (LDA) and KNN were used to classify incomplete data. Batista and Monard [53] tested the classification accuracy of two famous classifiers (C4.5 [15] and CN2 [54]) before imputation. Results showed that KNN-based imputation achieved the
The proposed method
To solve the problem of incomplete data classification effectively, this paper proposes a new method for incomplete data classification which embeds an imputation-revision framework to revise the results of existing simple MVI methods before classification. It integrates the superiorities of various simple MVI methods. This paper mainly focuses on the improvement of classification accuracy of classical MVI methods by the framework.
Let be an incomplete dataset; n denotes the number
Evaluation criterion
For incomplete dataset, one of the most common evaluation criteria is NRMSE, which is used to measure the imputation performance. In addition, learning results of incomplete data classification algorithm can be expressed by the classification accuracy. The definition of the NRMSE and Accuracy are shown as follows:
(a) NRMSE where and are imputed values and known answer values presented in the original dataset, respectively. What is more, denotes the
Conclusions and discussions
This paper attempts to find out the effect of local neighborhood information in MVI and makes a further consideration to combine the distinctive superiorities of various MVI methods so as to improve the performance of incomplete data classification. The proposed method contains a novel imputation-revision framework which tries to revise the results obtained by several classical MVI methods with local neighborhood information, and then conducts ensemble learning on the obtained complete datasets
CRediT authorship contribution statement
Yuanting Yan: Conceived the study, Performed the experiments, Analyzed the data, Drafted the paper, Reviewed the manuscript. Yaya Wu: Performed the experiments, Analyzed the data, Reviewed the manuscript. Xiuquan Du: Provided suggestions for the writing of the paper, Reviewed the manuscript. Yanping Zhang: Provided suggestions for the writing of the paper, Reviewed the manuscript.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 61806002, 61872002 and 61673020), Doctoral Scientific Research Start-Up Foundation of Anhui University . All authors approved the version of the manuscript to be published.
References (63)
- et al.
Recent advances in visual and infrared face recognition a review
Comput. Vis. Image Underst.
(2005) - et al.
Anomaly detection from incomplete data
ACM Trans. Knowl. Discov. Data
(2014) - et al.
Can k-nn imputation improve the performance of C4. 5 with small software project data sets? A comparative evaluation
J. Syst. Softw.
(2008) - et al.
Robust bayes classifiers
Artificial Intelligence
(2001) Rough set approach to incomplete information systems
Inf. Sci.
(1998)- et al.
A gentle introduction to imputation of missing values
J. Clin. Epidemiol.
(2006) - et al.
An imputation-based matrix factorization method for improving accuracy of collaborative filtering systems
Eng. Appl. Artif. Intell.
(2015) - et al.
Impact of imputation of missing values on classification error for discrete data
Pattern Recognit.
(2008) - et al.
Learn++. MF: A random subspace approach for the missing feature problem
Pattern Recognit.
(2010) - et al.
Comparison of five iterative imputation methods for multivariate classification
Chemometr. Intell. Lab. Syst.
(2013)