当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An information theoretic approach to quantify the stability of feature selection and ranking algorithms
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-03-12 , DOI: 10.1016/j.knosys.2020.105745
Rocío Alaiz-Rodríguez , Andrew C. Parnell

Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data by selecting the most relevant features out of the noisy, redundant and irrelevant features. A problem that arises in many of these practical applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations. We propose an information-theoretic approach based on the Jensen–Shannon divergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, feature subsets as well as the lesser studied partial ranked lists. This generalized metric quantifies the difference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties including correction for change, upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman’s rank correlation and the Kuncheva’s index on feature ranking and selection outcomes, respectively. Additionally, experimental validation of the proposed approach is carried out on a real-world problem of food quality assessment showing its potential to quantify stability from different perspectives.



中文翻译:

一种信息理论方法来量化特征选择和排序算法的稳定性

在处理高维数据时,特征选择是关键的一步。特别是,这些技术通过从嘈杂,冗余和不相关的特征中选择最相关的特征,简化了从数据中发现知识的过程。在许多这些实际应用中出现的问题是特征选择算法的结果不稳定。因此,数据中的小变化可能会产生非常不同的特征等级。在前面提到的情况下,评估这些方法的稳定性成为一个重要的问题。我们提出了一种基于詹森-香农散度的信息理论方法来量化这种鲁棒性。与其他稳定性指标不同,该指标适用于不同的算法结果:完整排名列表,特征子集以及研究较少的部分排名列表。遵循概率方法,该通用度量标准量化了具有相同大小的整个列表集合之间的差异,并且能够更加重视出现在列表顶部的分歧。而且,它具有理想的特性,包括对更改的校正,上下限和确定性选择的条件。我们举例说明了此稳定性度量与完全受控方式生成的数据的使用,并将其与流行度量(包括Spearman等级相关性和Kuncheva指数在特征排名和选择结果上的索引)进行了比较。另外,

更新日期:2020-03-12
down
wechat
bug