当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dealing with heterogeneity in the context of distributed feature selection for classification
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2020-11-21 , DOI: 10.1007/s10115-020-01526-4
José Luis Morillo-Salas , Verónica Bolón-Canedo , Amparo Alonso-Betanzos

Advances in the information technologies have greatly contributed to the advent of larger datasets. These datasets often come from distributed sites, but even so, their large size usually means they cannot be handled in a centralized manner. A possible solution to this problem is to distribute the data over several processors and combine the different results. We propose a methodology to distribute feature selection processes based on selecting relevant and discarding irrelevant features. This preprocessing step is essential for current high-dimensional sets, since it allows the input dimension to be reduced. We pay particular attention to the problem of data imbalance, which occurs because the original dataset is unbalanced or because the dataset becomes unbalanced after data partitioning. Most works approach unbalanced scenarios by oversampling, while our proposal tests both over- and undersampling strategies. Experimental results demonstrate that our distributed approach to classification obtains comparable accuracy results to a centralized approach, while reducing computational time and efficiently dealing with data imbalance.



中文翻译:

在用于分类的分布式特征选择的上下文中处理异质性

信息技术的进步极大地推动了更大数据集的出现。这些数据集通常来自分布式站点,但是即使如此,它们的大尺寸通常也意味着无法集中处理。解决此问题的一种可能的方法是将数据分布在多个处理器上,并组合不同的结果。我们提出了一种基于选择相关特征并丢弃不相关特征来分配特征选择过程的方法。对于当前的高维集,此预处理步骤至关重要,因为它可以减小输入维。我们特别注意数据不平衡的问题,这是由于原始数据集不平衡或由于数据分区后数据集变得不平衡而发生的。大多数作品通过过度采样来处理不平衡的场景,而我们的建议则同时测试了过采样策略和过采样策略。实验结果表明,我们的分布式分类方法可以获得与集中式方法相当的准确性结果,同时减少了计算时间并有效地处理了数据不平衡问题。

更新日期:2020-11-22
down
wechat
bug