Combining data discretization and missing value imputation for incomplete medical datasets.,PLOS ONE

当前位置： X-MOL 学术 › PLOS ONE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Combining data discretization and missing value imputation for incomplete medical datasets.
PLOS ONE ( IF 3.7 ) Pub Date : 2023-11-30 , DOI: 10.1371/journal.pone.0295032
Min-Wei Huang,Chih-Fong Tsai,Shu-Ching Tsui,Wei-Chao Lin

Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.

中文翻译：

结合数据离散化和缺失值插补来处理不完整的医疗数据集。

数据离散化旨在将一组连续特征转化为离散特征，从而简化信息的表示，使其更易于理解、使用和解释。在实践中，用户可以利用离散化过程来改进对包含连续特征的医学领域问题数据集的知识发现和数据分析。然而，某些特征值经常缺失。许多数据挖掘算法无法处理不完整的数据集。在本研究中，我们考虑使用离散化和缺失值插补来处理不完整的医疗数据集，研究离散化和缺失值插补的顺序如何影响性能。实验结果是使用七个不同的医学领域问题数据集获得的：两个离散器，包括最小描述长度原则（MDLP）和ChiMerge；三种插补方法，包括均值/众数、分类和回归树（CART）和k近邻（KNN）方法；和两个分类器，包括支持向量机 (SVM) 和 C4.5 决策树。结果表明，先进行离散化，然后进行插补可以获得更好的性能，而不是相反。此外，将 ChiMerge 和 KNN 与 SVM 相结合实现了最高的分类准确率。

更新日期：2023-11-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>