当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2021-04-01 , DOI: 10.1007/s10115-021-01560-w
Dariusz Brzezinski , Leandro L. Minku , Tomasz Pewinski , Jerzy Stefanowski , Artur Szumaczuk

Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.



中文翻译:

数据难度因素对不平衡和概念漂移数据流分类的影响

从概念漂移数据流中学习分类器时,类不平衡带来了其他挑战。现有的大多数工作都集中在设计用于处理全局不平衡率的新算法上,并且没有考虑其他数据复杂性。对静态不平衡数据的独立研究凸显了本地数据困难因素(如少数族裔类别分解和不安全类型的示例的存在)的影响作用。尽管现实世界中的数据经常出现,但概念漂移数据流中尚未研究概念漂移与局部数据难度因素之间的相互作用。我们彻底研究了这种相互作用对漂移的不平衡流的影响。为此,我们针对类不平衡问题提出了新的概念漂移分类。通过综合和真实数据流的综合实验,我们研究了概念漂移,全局类别不平衡,局部数据难度因素及其组合对代表性在线分类器的预测的影响。实验结果揭示了新考虑的因素及其局部漂移的巨大影响,以及现有分类器对这些因素的反应的差异。对于分类器来说,多个因素的组合最具挑战性。尽管现有的分类器能够部分解决全局类的不平衡问题,但仍需要新的方法来应对数据流不平衡带来的挑战。关于代表性在线分类器的预测。实验结果揭示了新考虑的因素及其局部漂移的巨大影响,以及现有分类器对这些因素的反应的差异。对于分类器来说,多个因素的组合最具挑战性。尽管现有的分类器能够部分解决全局类的不平衡问题,但仍需要新的方法来应对数据流不平衡带来的挑战。关于代表性在线分类器的预测。实验结果揭示了新考虑的因素及其局部漂移的巨大影响,以及现有分类器对这些因素的反应的差异。对于分类器来说,多个因素的组合最具挑战性。尽管现有的分类器能够部分解决全局类的不平衡问题,但仍需要新的方法来应对数据流不平衡带来的挑战。

更新日期:2021-04-01
down
wechat
bug