Feature selection for imbalanced data with deep sparse autoencoders ensemble,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Feature selection for imbalanced data with deep sparse autoencoders ensemble
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2021-12-12 , DOI: 10.1002/sam.11567
Michela Carlotta Massi _{1,

2} , Francesca Gasperoni ₃ , Francesca Ieva _{1,

2} , Anna Maria Paganoni _{1,

2}

Affiliation

Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.

中文翻译：

具有深度稀疏自动编码器集成的不平衡数据的特征选择

类不平衡是学习算法的许多领域应用中的常见问题。通常，在同一领域中，正确分类和描述少数类观察更为相关。这种需求可以通过特征选择（FS）来解决，它提供了几个进一步的优势，例如降低计算成本、帮助推理和可解释性。然而，在存在严重不平衡的数据时，传统的 FS 技术可能会变得次优。为了在这种情况下实现 FS 优势，我们提出了一种基于深度稀疏自编码器集成 (DSAEE) 的重构误差的过滤 FS 算法排序特征重要性。我们使用每个仅在多数类上训练的 DSAE 来重建这两个类。从聚合重构误差的分析来看，我们确定少数类与过度代表的值相比呈现不同分布的特征，从而确定最相关的特征以区分两者。我们在模拟和不同样本大小的高维数据集上通过实验证明了我们的算法的有效性，展示了其选择相关和可概括的特征来分析和分类少数类的能力，优于其他基准 FS 方法。我们还简要介绍了放射基因组学中的实际应用，该方法已成功应用。在模拟和不同样本大小的高维数据集上，展示了其选择相关和可概括的特征来分析和分类少数类的能力，优于其他基准 FS 方法。我们还简要介绍了放射基因组学中的实际应用，该方法已成功应用。在模拟和不同样本大小的高维数据集上，展示了其选择相关和可概括的特征来分析和分类少数类的能力，优于其他基准 FS 方法。我们还简要介绍了放射基因组学中的实际应用，该方法已成功应用。

更新日期：2021-12-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11