SICE: an improved missing data imputation technique.,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SICE: an improved missing data imputation technique.
Journal of Big Data ( IF 8.6 ) Pub Date : 2020-06-12 , DOI: 10.1186/s40537-020-00313-w
Shahidul Islam Khan _{1,

2} , Abu Sayed Md Latiful Hoque ₁

Affiliation

In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

中文翻译：

SICE：一种改进的缺失数据插补技术。

在数据分析中，丢失数据是降低性能的一个因素。缺失值的错误插补可能导致错误的预测。在这个大数据时代，每秒钟都会产生海量的数据，而这些数据的利用是利益相关者的主要关注点，那么有效处理缺失值就变得更加重要。在本文中，我们提出了一种用于缺失数据插补的新技术，它是单插补和多插补技术的混合方法。我们提出了通过链式方程 (MICE) 对流行的多元插补进行扩展算法有两种变体来估算分类和数值数据。我们还实施了十二种现有算法来估算二进制、序数和数字缺失值。我们从孟加拉国不同的医院和诊断中心收集了六万五千份真实的健康记录，保护了数据的隐私。我们还从 UCI 机器学习存储库、苏黎世联邦理工学院和 Kaggle 收集了三个公共数据集。我们已经将我们提出的算法的性能与使用这些数据集的现有算法进行了比较。实验结果表明，与具有相似执行时间的竞争对手相比，我们提出的算法在二进制数据插补上实现了 20% 的高 F-measure 和 11% 的数字数据插补错误。

更新日期：2020-06-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文