当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Class center-based firefly algorithm for handling missing data
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-02-23 , DOI: 10.1186/s40537-021-00424-y
Heru Nugroho , Nugraha Priya Utama , Kridanto Surendro

A significant advancement that occurs during the data cleaning stage is estimating missing data. Studies have shown that improper data handling leads to inaccurate analysis. Furthermore, most studies indicate the occurrence of missing data irrespective of the correlation between attributes. However, an adaptive search procedure helps to determine the estimates of the missing data when correlations between attributes are considered in the process. Firefly Algorithm (FA) implements an adaptive search procedure in the imputation of the missing data by determining the estimated value closest to others' value. Therefore, this study proposes a class center-based adaptive approach model for retrieving missing data by considering the attribute correlation in the imputation process (C3-FA). The result showed that the class center-based firefly algorithm (FA) is an efficient technique for obtaining the actual value in handling missing data with the Pearson correlation coefficient (r) and root mean squared error (RMSE) close to 1 and 0, respectively. In addition, the proposed method has the ability to maintain the true distribution of data values. This is indicated by the Kolmogorov–Smirnov test, which stated that the value of DKS for most attributes in the dataset is generally closer to 0. Furthermore, the accuracy evaluation results using three classifiers showed that the proposed method produces good accuracy.



中文翻译:

基于类中心的萤火虫算法,用于处理丢失的数据

在数据清理阶段发生的重大进步是估计丢失的数据。研究表明,不正确的数据处理会导致分析不准确。此外,大多数研究表明,不管属性之间的相关性如何,都会出现丢失的数据。但是,当在过程中考虑属性之间的相关性时,自适应搜索过程有助于确定丢失数据的估计。Firefly算法(FA)通过确定最接近他人值的估计值,在缺失数据的插补中实现了自适应搜索过程。因此,本研究提出了一种基于类中心的自适应方法模型,该模型通过考虑插补过程(C3-FA)中的属性相关性来检索丢失的数据。r)和均方根误差(RMSE)分别接近1和0。另外,所提出的方法具有维持数据值的真实分布的能力。Kolmogorov–Smirnov检验表明了这一点,该检验表明数据集中大多数属性的D KS值通常接近0。此外,使用三个分类器的准确性评估结果表明,该方法产生了良好的准确性。

更新日期:2021-02-23
down
wechat
bug