当前位置:
X-MOL 学术
›
arXiv.cs.SC
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
DI2: prior-free and multi-item discretization ofbiomedical data and its applications
arXiv - CS - Symbolic Computation Pub Date : 2021-03-07 , DOI: arxiv-2103.04356 Leonardo Alexandre, Rafael S. Costa, Rui Henriques
arXiv - CS - Symbolic Computation Pub Date : 2021-03-07 , DOI: arxiv-2103.04356 Leonardo Alexandre, Rafael S. Costa, Rui Henriques
Motivation: A considerable number of data mining approaches for biomedical
data analysis, including state-of-the-art associative models, require a form of
data discretization. Although diverse discretization approaches have been
proposed, they generally work under a strict set of statistical assumptions
which are arguably insufficient to handle the diversity and heterogeneity of
clinical and molecular variables within a given dataset. In addition, although
an increasing number of symbolic approaches in bioinformatics are able to
assign multiple items to values occurring near discretization boundaries for
superior robustness, there are no reference principles on how to perform
multi-item discretizations. Results: In this study, an unsupervised discretization method, DI2, for
variables with arbitrarily skewed distributions is proposed. DI2 provides
robust guarantees of generalization by placing data corrections using the
Kolmogorov-Smirnov test before statistically fitting distribution candidates.
DI2 further supports multi-item assignments. Results gathered from biomedical
data show its relevance to improve classic discretization choices. Software: available at https://github.com/JupitersMight/DI2
中文翻译:
DI2:生物医学数据的无先验和多项目离散化及其应用
动机:用于生物医学数据分析的大量数据挖掘方法,包括最新的关联模型,都需要某种形式的数据离散化。尽管已经提出了多样化的离散化方法,但是它们通常在严格的统计假设下工作,可以说不足以处理给定数据集中临床和分子变量的多样性和异质性。另外,尽管越来越多的生物信息学中的符号方法能够将多个项目分配给离散化边界附近的值以实现卓越的鲁棒性,但是关于如何执行多项目离散化尚无参考原理。结果:在这项研究中,针对具有任意偏斜分布的变量,提出了一种无监督的离散化方法DI2。通过在统计上拟合分布候选值之前使用Kolmogorov-Smirnov检验进行数据校正,DI2为泛化提供了强有力的保证。DI2还支持多项目分配。从生物医学数据收集的结果表明,它与改进经典离散化选择具有相关性。软体:https://github.com/JupitersMight/DI2
更新日期:2021-03-09
中文翻译:
DI2:生物医学数据的无先验和多项目离散化及其应用
动机:用于生物医学数据分析的大量数据挖掘方法,包括最新的关联模型,都需要某种形式的数据离散化。尽管已经提出了多样化的离散化方法,但是它们通常在严格的统计假设下工作,可以说不足以处理给定数据集中临床和分子变量的多样性和异质性。另外,尽管越来越多的生物信息学中的符号方法能够将多个项目分配给离散化边界附近的值以实现卓越的鲁棒性,但是关于如何执行多项目离散化尚无参考原理。结果:在这项研究中,针对具有任意偏斜分布的变量,提出了一种无监督的离散化方法DI2。通过在统计上拟合分布候选值之前使用Kolmogorov-Smirnov检验进行数据校正,DI2为泛化提供了强有力的保证。DI2还支持多项目分配。从生物医学数据收集的结果表明,它与改进经典离散化选择具有相关性。软体:https://github.com/JupitersMight/DI2