Multi-class imbalanced big data classification on Spark,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-class imbalanced big data classification on Spark
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-11-07 , DOI: 10.1016/j.knosys.2020.106598
William C. Sleeman IV , Bartosz Krawczyk

Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced.

中文翻译：

Spark上的多类不平衡大数据分类

尽管取得了二十多年的进步，但是从不平衡数据中学习仍然被认为是机器学习中的当代挑战之一。大数据时代的到来使情况变得更加复杂，由于数据集的数量大，专用于减轻类偏斜影响的流行算法不再可行。另外，大多数现有算法都集中在二进制不平衡问题上，其中大多数和少数派类别都得到了很好的定义。多类不平衡数据提出了进一步的挑战，因为类之间的关系要复杂得多，并且简单分解为许多二进制问题会导致大量信息丢失。在本文中，我们提出了第一个用于处理多类大数据问题的复合框架，同时解决多个类和大量数据的存在。我们建议分析每个班级的实例级别的困难，以了解导致学习困难的原因。我们将此信息嵌入流行的重采样算法中，该算法可实现多个类的信息平衡。我们提出了在Apache Spark上讨论的算法的有效实现，其中包括SMOTE的新版本，该版本克服了其前身的分布式环境中的空间限制。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。我们建议分析每个班级的实例级别的困难，以了解导致学习困难的原因。我们将此信息嵌入流行的重采样算法中，该算法可实现多个类的信息平衡。我们提出了在Apache Spark上讨论的算法的有效实现，其中包括SMOTE的新版本，该版本克服了其前身的分布式环境中的空间限制。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。我们建议分析每个班级的实例级别的困难，以了解导致学习困难的原因。我们将此信息嵌入流行的重采样算法中，该算法可实现多个类的信息平衡。我们提出了在Apache Spark上讨论的算法的有效实现，其中包括SMOTE的新版本，该版本克服了其前身的分布式环境中的空间限制。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。我们将此信息嵌入流行的重采样算法中，该算法可实现多个类的信息平衡。我们提出了在Apache Spark上讨论的算法的有效实现，其中包括SMOTE的新版本，该版本克服了其前身的分布式环境中的空间限制。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。我们将此信息嵌入流行的重采样算法中，该算法可实现多个类的信息平衡。我们提出了在Apache Spark上讨论的算法的有效实现，其中包括SMOTE的新版本，该版本克服了其前身的分布式环境中的空间限制。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。大量的实验研究表明，使用实例级信息可以显着改善从多类不平衡大数据中的学习。我们的框架可以从https://github.com/fsleeman/minority-type-imbalanced下载。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11