Elsevier

Applied Soft Computing

Volume 99, February 2021, 106905
Applied Soft Computing

Incomplete data ensemble classification using imputation-revision framework with local spatial neighborhood information

https://doi.org/10.1016/j.asoc.2020.106905Get rights and content

Highlights

  • An imputation-revision procedure is proposed to revise the imputed missing values.

  • It detects and uses the local spatial neighborhood information for data revision.

  • It integrates the advantages of various existing missing value imputation methods.

  • The framework achieves better performance by combining several simple imputations.

Abstract

Most existing machine learning techniques require complete data. However, incomplete patterns are common in many real-world scenarios due to the missing values (MVs). Various Missing value imputation (MVI) methods have been proposed to recover the MVs. Each of them has its own advantages in some scenarios. However, on the one hand, few of them consider taking advantages of different MVI methods; on the other hand, how to improve the imputation performance with local information is still an open problem. This paper proposes an imputation-revision framework with local spatial neighborhood information for incomplete data classification. The proposed method endeavors to combine the advantages of several imputation methods. It first obtains several complete datasets which are pre-filled by various MVI methods. Then, it detects the local neighborhood information (LNI) of samples and revises MVs based on the LNI. Finally, ensemble technique is employed to give a final decision. Numerical experiments have verified the superiority of the proposed method in terms of both prediction accuracy and algorithm stability.

Introduction

Classification is one of the most important tasks in machine learning (ML) [1], [2]. In recent years, it has been widely applied to many fields, such as biometric recognition [3], text classification [4], medical diagnosis [5] and so on [6], [7]. With the rapid development of technologies such as sensor technology, information technology, the acquisition of data is becoming increasingly diverse. Data plays an extremely important role in the current digital economy and brings great opportunities for the development of ML.

It is worth noting that the integrity of data is essential for most of the existing ML algorithms. However, in practice, the loss of information is inevitable due to various reasons such as damage of the storage equipment, failed pixels, limited capacity of data acquisition equipment, unanswered questions in surveys and so on. For example, 45% of the UCI datasets contain missing values (MVs) [8], and it is common in microarray data [9], [10], mobile phone data [11], visual data [12], industrial data [13] and software project data [14], etc. Since most existing data analysis methods are designed for complete data, the existence of MVs result in inapplicability of most existing data analysis methods.

Although there are a few algorithms such as C4.5 [15] that can handle MVs directly, the classification performance will be greatly decreased when the original dataset contains a large number of MVs [16], [17]. Many strategies have been proposed to solve the problem of incomplete data classification. The simplest strategy is case-deletion [18], which ignores (or discards) the samples contain MVs and only considers the impact of complete samples’ information in the original dataset. However, this strategy may lead to insufficient samples of investigation due to the loss of useful information. In addition, many sophisticated methods are proposed to improve incomplete data classification performance, such as Robust Bayes Classifier(RBC) [19], rough-set based methods [20], [21], [22] and three-way decision based methods [23], [24], etc.

In addition to the above methods, MVI is a much more popular strategy to deal with incomplete data which replaces the MVs with estimated ones [25], [26], [27]. In the past decades, numerous MVI methods have been developed and proved their own distinctive superiorities in different scenarios, most of them are based on statistic knowledge and machine learning [28], [29], [30], [31], [32], [33]. These methods often share a similar idea of imputing MVs in incomplete tuples through their complete neighbors in the data space. For example, mean imputation (MEI) [34] replaces all missing values in each feature with the mean of corresponding complete attribute; KNN-based imputation (KNNI) [35] selects k samples which are most similar to the samples with MVs by calculating the Euclidean distance and then imputes each MV with the corresponding complete attribute values of the k samples. Besides, classical MVI methods such as median imputation (MEDI) [34], SOFT imputation (SOFTI) [36], Matrix factorization imputation (MFI) [37] and other variant of above-mentioned methods such as W-KNNI [38] have been proposed to handle MVs.

However, these methods are all single imputation methods that generate a single value for each missing value. Rubin et al. [39] proposed the idea of multiple imputation. Different from single imputation, multiple imputation creates several complete datasets by imputing different values, those datasets were analyzed separately and then combined to boost a final result. MICE [40] is one of the typical multiple imputation methods. Researches show that the classification accuracy of the incomplete dataset can be improved when utilize MVI methods to convert MVs to complete ones before classification [41]. However, existing MVI methods seldom consider the local spatial neighborhood information which is proved to be effective in many application scenarios [42], [43], [44], [45]. In addition, few of them can combine multiple strengths of various MVI methods. How to integrate the advantages of various MVI methods and revise MVs according to the local spatial neighborhood information should be investigated.

Ensemble learning is a powerful technique in improving learning performance which constructs multiple classifiers instead of single classifier for classification [46]. Studies on using ensemble learning to deal with incomplete data can be roughly divided into two categories: (1) Ensemble classification methods for incomplete data which construct a set of classifiers according to a group of data subsets without any imputation [47], [48], [49], [50]. (2) Utilize an imputation strategy (single imputation or multiple imputation) on incomplete data and then conduct ensemble learning on the complete dataset. The former faces a fatal problem, which is most of the existing ensemble classification methods do not perform well when the original dataset contains a large number of MVs [47], [48]. For the second strategy, some techniques such as feature selection, random sampling are used to obtain a set of different complete datasets in the training process and then a group of classifiers will be learned from these datasets [51].

In this paper, the proposed framework advocates to integrate the advantages of different MVI methods which have been proven that they all have distinctive superiorities in different scenes. To be specific, it selects some classical MVI methods to pre-fill the original incomplete dataset, thus multiple complete datasets are obtained. Unlike most established imputation methods, the proposed framework imputes the incomplete dataset with several imputation strategies to obtain a group of complete datasets at first, and then it revises the missing values on each complete dataset according to the local spatial neighborhood information (LSNI) separately. Finally, it conducts ensemble classification on those revised datasets.

The rest of this paper is organized as follows. Section 2 briefly reviews some related works. The proposed method is presented in Section 3. Numerical experiments were conducted to validate the effectiveness of the proposed method in Section 4. Section 5 concludes the paper.

Section snippets

The study of the effectiveness of MVI techniques

Acuna and Rodriquez [52] compared four methods to deal with the problem of incomplete data classification, including case-deletion and other three MVI methods (e.g. Mean imputation, Median imputation and KNN-based imputation), then linear discriminant analysis (LDA) and KNN were used to classify incomplete data. Batista and Monard [53] tested the classification accuracy of two famous classifiers (C4.5 [15] and CN2 [54]) before imputation. Results showed that KNN-based imputation achieved the

The proposed method

To solve the problem of incomplete data classification effectively, this paper proposes a new method for incomplete data classification which embeds an imputation-revision framework to revise the results of existing simple MVI methods before classification. It integrates the superiorities of various simple MVI methods. This paper mainly focuses on the improvement of classification accuracy of classical MVI methods by the framework.

Let X=x1,x2,,xm be an incomplete dataset; n denotes the number

Evaluation criterion

For incomplete dataset, one of the most common evaluation criteria is NRMSE, which is used to measure the imputation performance. In addition, learning results of incomplete data classification algorithm can be expressed by the classification accuracy. The definition of the NRMSE and Accuracy are shown as follows:

(a) NRMSE NRMSE=meanximpxori2stdxoriwhere ximp and xori are imputed values and known answer values presented in the original dataset, respectively. What is more, stdxori denotes the

Conclusions and discussions

This paper attempts to find out the effect of local neighborhood information in MVI and makes a further consideration to combine the distinctive superiorities of various MVI methods so as to improve the performance of incomplete data classification. The proposed method contains a novel imputation-revision framework which tries to revise the results obtained by several classical MVI methods with local neighborhood information, and then conducts ensemble learning on the obtained complete datasets

CRediT authorship contribution statement

Yuanting Yan: Conceived the study, Performed the experiments, Analyzed the data, Drafted the paper, Reviewed the manuscript. Yaya Wu: Performed the experiments, Analyzed the data, Reviewed the manuscript. Xiuquan Du: Provided suggestions for the writing of the paper, Reviewed the manuscript. Yanping Zhang: Provided suggestions for the writing of the paper, Reviewed the manuscript.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61806002, 61872002 and 61673020), Doctoral Scientific Research Start-Up Foundation of Anhui University . All authors approved the version of the manuscript to be published.

References (63)

  • HanJ. et al.

    Data Mining: Concepts and Techniques

    (2011)
  • DudaR.O. et al.

    Pattern Classification

    (2012)
  • LarranagaP. et al.

    Machine learning in bioinformatics

    Brief. Bioinform.

    (2006)
  • SebastianiF.

    Machine learning in automated text categorization

    ACM Comput. Surv.

    (2002)
  • XiaoF. et al.

    A computational model for heart failure stratification

  • FialhoA.S. et al.

    Probabilistic fuzzy prediction of mortality in intensive care units

  • LichmanM.

    Uci machine learning repository

    (2013)
  • AittokallioT.

    Dealing with missing values in large-scale studies: microarray data imputation and beyond

    Brief. Bioinform.

    (2009)
  • De SoutoM.C. et al.

    Impact of missing data imputation methods on gene expression clustering and classification

    BMC Bioinformatics

    (2015)
  • LiuJ. et al.

    Tensor completion for estimating missing values in visual data

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • LakshminarayanK. et al.

    Imputation of missing data in industrial databases

    Appl. Intell.

    (1999)
  • QuinlanJ.R.

    C4. 5: Programs for Machine Learning

    (2014)
  • García-LaencinaP.J. et al.

    Pattern classification with missing data: a review

    Neural Comput. Appl.

    (2010)
  • EndersC.K.

    Applied Missing Data Analysis

    (2010)
  • AllisonP.D.

    Missing Data, Vol. 136

    (2001)
  • H. Zhao, K. Qin, Mixed feature selection in incomplete decision table, Knowl.-Based Syst....
  • NowickiR.K. et al.

    Application of rough sets in k nearest neighbours algorithm for classification of incomplete samples

  • C. Luo, T. Li, Y. Huang, H. Fujita, Updating three-way decisions in incomplete multi-scale information systems, Inf....
  • A novel three-way decision model based on incomplete information...
  • LittleR.J. et al.

    Statistical Analysis with Missing Data, Vol. 793

    (2019)
  • LuengoJ. et al.

    On the choice of the best imputation methods for missing values considering three groups of classification methods

    Knowl. Inf. Syst.

    (2012)
  • Cited by (0)

    View full text