当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimal Decision Trees for Nonlinear Metrics
arXiv - CS - Data Structures and Algorithms Pub Date : 2020-09-15 , DOI: arxiv-2009.06921
Emir Demirovi\'c, Peter J. Stuckey

Nonlinear metrics, such as the F1-score, Matthews correlation coefficient, and Fowlkes-Mallows index, are often used to evaluate the performance of machine learning models, in particular, when facing imbalanced datasets that contain more samples of one class than the other. Recent optimal decision tree algorithms have shown remarkable progress in producing trees that are optimal with respect to linear criteria, such as accuracy, but unfortunately nonlinear metrics remain a challenge. To address this gap, we propose a novel algorithm based on bi-objective optimisation, which treats misclassifications of each binary class as a separate objective. We show that, for a large class of metrics, the optimal tree lies on the Pareto frontier. Consequently, we obtain the optimal tree by using our method to generate the set of all nondominated trees. To the best of our knowledge, this is the first method to compute provably optimal decision trees for nonlinear metrics. Our approach leads to a trade-off when compared to optimising linear metrics: the resulting trees may be more desirable according to the given nonlinear metric at the expense of higher runtimes. Nevertheless, the experiments illustrate that runtimes are reasonable for majority of the tested datasets.

中文翻译:

非线性度量的最优决策树

非线性指标,例如 F1 分数、Matthews 相关系数和 Fowlkes-Mallows 指数,通常用于评估机器学习模型的性能,特别是在面临包含一类样本多于另一类样本的不平衡数据集时。最近的最优决策树算法在生成关于线性标准(例如准确性)最优的树方面取得了显着进展,但不幸的是,非线性度量仍然是一个挑战。为了解决这个差距,我们提出了一种基于双目标优化的新算法,它将每个二元类的错误分类视为一个单独的目标。我们表明,对于一大类指标,最优树位于帕累托边界。因此,我们通过使用我们的方法生成所有非支配树的集合来获得最佳树。据我们所知,这是第一种为非线性度量计算可证明最优决策树的方法。与优化线性度量相比,我们的方法导致了一种权衡:根据给定的非线性度量,以更高的运行时间为代价,生成的树可能更理想。尽管如此,实验表明运行时间对于大多数测试数据集是合理的。
更新日期:2020-09-16
down
wechat
bug