Stable bagging feature selection on medical data,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Stable bagging feature selection on medical data
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-01-07 , DOI: 10.1186/s40537-020-00385-8
Salem Alelyani

In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

中文翻译：

根据医疗数据选择稳定的装袋功能

在医学领域，区分与特定疾病（例如结肠癌）相关的基因，对于找到治疗方法并了解其病因和随后的并发症至关重要。通常，医学数据集由非常复杂的维度组成，并且样本量很小。因此，对于领域专家，例如生物学家而言，至少可以说，鉴定这些基因的任务已成为一项非常具有挑战性的任务。特征选择是一种旨在针对疾病选择这些基因或机器学习领域中特征的技术。但是，从医学数据集中学习以识别相关特征会遭受维度诅咒的困扰。由于大量特征且样本量较小，因此每次将新样本引入数据集时，选择项通常会返回不同的子集。这种选择不稳定性与数据差异本质上相关。我们假设减少数据差异会提高选择的稳定性。在本文中，我们提出了一种基于装袋技术的集成方法，以通过减少数据方差来提高医疗数据集中特征选择的稳定性。我们使用四个微阵列数据集进行了实验，每个数据集都具有高维数和相对较小的样本量。在每个数据集上，我们应用了五种众所周知的特征选择算法来选择数量不等的特征。所提出的技术显示出选择稳定性的显着提高，同时至少保持了分类精度。在所有情况下，稳定性提高20％到50％。这意味着选择相同特征的可能性增加了20％到50％。在大多数情况下，这伴随着分类准确性的提高，这表明了所述结果的稳定性。

更新日期：2021-01-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文