当前位置: X-MOL 学术arXiv.cs.SC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DI2: prior-free and multi-item discretization ofbiomedical data and its applications
arXiv - CS - Symbolic Computation Pub Date : 2021-03-07 , DOI: arxiv-2103.04356
Leonardo Alexandre, Rafael S. Costa, Rui Henriques

Motivation: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. DI2 provides robust guarantees of generalization by placing data corrections using the Kolmogorov-Smirnov test before statistically fitting distribution candidates. DI2 further supports multi-item assignments. Results gathered from biomedical data show its relevance to improve classic discretization choices. Software: available at https://github.com/JupitersMight/DI2

中文翻译:

DI2:生物医学数据的无先验和多项目离散化及其应用

动机:用于生物医学数据分析的大量数据挖掘方法,包括最新的关联模型,都需要某种形式的数据离散化。尽管已经提出了多样化的离散化方法,但是它们通常在严格的统计假设下工作,可以说不足以处理给定数据集中临床和分子变量的多样性和异质性。另外,尽管越来越多的生物信息学中的符号方法能够将多个项目分配给离散化边界附近的值以实现卓越的鲁棒性,但是关于如何执行多项目离散化尚无参考原理。结果:在这项研究中,针对具有任意偏斜分布的变量,提出了一种无监督的离散化方法DI2。通过在统计上拟合分布候选值之前使用Kolmogorov-Smirnov检验进行数据校正,DI2为泛化提供了强有力的保证。DI2还支持多项目分配。从生物医学数据收集的结果表明,它与改进经典离散化选择具有相关性。软体:https://github.com/JupitersMight/DI2
更新日期:2021-03-09
down
wechat
bug