当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Imbalance-XGBoost: leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2020-06-02 , DOI: 10.1016/j.patrec.2020.05.035
Chen Wang , Chengyuan Deng , Suzhen Wang

The paper presents Imbalance-XGBoost, a Python package that combines the powerful XGBoost software with weighted and focal losses to tackle binary label-imbalanced classification tasks. Though a small-scale program in terms of size, the package is, to the best of our knowledge, the first of its kind which provides an integrated implementation for the two loss functions on XGBoost and brings a general-purpose extension to XGBoost for label-imbalanced scenarios. In this paper, the design and usage of the package are discussed and illustrated with examples. Furthermore, as the first- and second-order derivatives of the loss functions are essential for the implementations, the algebraic derivation is discussed and it can be deemed as a separate contribution. The performances of the methods implemented in the package are extensively evaluated on Parkinson’s disease classification dataset, and multiple competitive performances are presented with the ROC and Precision-Recall (PR) curves. To further assert the superiority of the methods, the performances on four other benchmark datasets from the UCI machine learning repository are additionally reported. Given the scalable nature of XGBoost, the package has great potentials to be broadly applied to real-life binary classification tasks, which are usually of large-scale and label-imbalanced.



中文翻译:

Imbalance-XGBoost:利用XGBoost利用加权损失和焦点损失实现二进制标签不平衡分类

本文介绍了不平衡-XGBoost,这是一个Python软件包,将强大的XGBoost软件与加权损失和焦点损失结合在一起,可以解决二进制标签不平衡的分类任务。尽管就规模而言,这是一个小型计划,但据我们所知,该软件包是第一个此类软件包,它为XGBoost上的两个损失函数提供了集成实现,并为标签的XGBoost带来了通用扩展-不平衡的情况。在本文中,将通过示例讨论和说明该软件包的设计和使用。此外,由于损失函数的一阶和二阶导数对于实现至关重要,因此我们讨论了代数导数并将其视为单独的贡献。在帕金森氏病分类数据集中,对该软件包中实施的方法的性能进行了广泛的评估,ROC和Precision-Recall(PR)曲线展示了多种竞争表现。为了进一步证明该方法的优越性,还报告了来自UCI机器学习存储库的其他四个基准数据集的性能。鉴于XGBoost的可伸缩性,该程序包具有广阔的潜力,可广泛应用于通常是大规模且标签不平衡的现实生活中的二进制分类任务。

更新日期:2020-06-23
down
wechat
bug