A systematic review of machine learning-based missing value imputation techniques,Data Technologies and Applications

当前位置： X-MOL 学术 › Data Technol. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A systematic review of machine learning-based missing value imputation techniques
Data Technologies and Applications ( IF 1.6 ) Pub Date : 2021-04-02 , DOI: 10.1108/dta-12-2020-0298
Tressy Thomas , Enayat Rajabi

Purpose

The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?

Design/methodology/approach

The review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.

Findings

This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.

Originality/value

It is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

中文翻译：

基于机器学习的缺失值插补技术的系统回顾

目的

本研究的主要目的是回顾不同维度的研究，包括用于数据插补的新方法的方法类型、实验设置和评估指标，特别是在机器学习 (ML) 领域。这最终提供了对拟议框架的评估情况以及提案中解决的缺失类型和比率的理解。本研究的回顾问题是 (1) 2010-2020 年期间研究和提出的基于 ML 的插补方法是什么？(2) 在这些研究中如何使用实验设置、数据集的特征和缺失？(3) 对插补法的评价采用了哪些指标？

设计/方法/方法

审查过程经历了标准的识别、筛选和选择过程。基于 ML 算法的缺失值插补 (MVI) 电子数据库的初步搜索返回了大量论文，总计 2,883 篇。这个阶段的大多数论文都不是与本研究相关的 MVI 技术。文献综述首先在标题中进行扫描以确定相关性，并确定了 306 篇文献综述是合适的。审查摘要文本后，不符合本研究条件的 151 篇文献评论将被删除。这产生了 155 篇适合全文审查的研究论文。由此，117篇论文用于评估审查问题。

发现

这项研究表明，基于聚类和基于实例的算法是最常被提出的 MVI 方法。正确预测百分比 (PCP) 和均方根误差 (RMSE) 是这些研究中最常用的评估指标。对于实验，大多数研究从公开可用的数据集存储库中获取数据集。一种常见的方法是将完整的数据集设置为基线，以评估对人为缺失的测试数据集进行插补的有效性。数据集大小和缺失率因实验而异，而缺失数据类型和机制与插补能力有关。计算费用是一个问题，使用大型数据集进行实验似乎是一个挑战。