当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Smart Data based Ensemble for Imbalanced Big Data Classification
arXiv - CS - Machine Learning Pub Date : 2020-01-16 , DOI: arxiv-2001.05759
Diego Garc\'ia-Gil, Johan Holmberg, Salvador Garc\'ia, Ning Xiong, Francisco Herrera

Big Data scenarios pose a new challenge to traditional data mining algorithms, since they are not prepared to work with such amount of data. Smart Data refers to data of enough quality to improve the outcome from a data mining algorithm. Existing data mining algorithms unability to handle Big Datasets prevents the transition from Big to Smart Data. Automation in data acquisition that characterizes Big Data also brings some problems, such as differences in data size per class. This will lead classifiers to lean towards the most represented classes. This problem is known as imbalanced data distribution, where one class is underrepresented in the dataset. Ensembles of classifiers are machine learning methods that improve the performance of a single base classifier by the combination of several of them. Ensembles are not exempt from the imbalanced classification problem. To deal with this issue, the ensemble method have to be designed specifically. In this paper, a data preprocessing ensemble for imbalanced Big Data classification is presented, with focus on two-class problems. Experiments carried out in 21 Big Datasets have proved that our ensemble classifier outperforms classic machine learning models with an added data balancing method, such as Random Forests.

中文翻译:

用于不平衡大数据分类的基于智能数据的集成

大数据场景对传统的数据挖掘算法提出了新的挑战,因为它们没有准备好处理如此大量的数据。智能数据是指质量足以改善数据挖掘算法结果的数据。现有的数据挖掘算法无法处理大数据集,阻碍了从大数据到智能数据的过渡。以大数据为特征的数据采集自动化也带来了一些问题,例如每类数据大小的差异。这将导致分类器倾向于最具代表性的类。这个问题被称为数据分布不平衡,其中一类在数据集中代表性不足。分类器的集成是机器学习方法,它通过组合几个基本分类器来提高单个基分类器的性能。集成不能免于不平衡分类问题。为了解决这个问题,必须专门设计集成方法。在本文中,提出了一种用于不平衡大数据分类的数据预处理集成,重点是两类问题。在 21 个大数据集中进行的实验证明,我们的集成分类器通过添加数据平衡方法(例如随机森林)优于经典机器学习模型。
更新日期:2020-01-17
down
wechat
bug