A branch & bound algorithm to determine optimal cross-splits for decision tree induction,Annals of Mathematics and Artificial Intelligence

当前位置： X-MOL 学术 › Ann. Math. Artif. Intel. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A branch & bound algorithm to determine optimal cross-splits for decision tree induction
Annals of Mathematics and Artificial Intelligence ( IF 1.2 ) Pub Date : 2020-01-03 , DOI: 10.1007/s10472-019-09684-0
Ferdinand Bollwein , Martin Dahmen , Stephan Westphal

State-of-the-art decision tree algorithms are top-down induction heuristics which greedily partition the attribute space by iteratively choosing the best split on an individual attribute. Despite their attractive performance in terms of runtime, simple examples, such as the XOR-Problem, point out that these heuristics often fail to find the best classification rules if there are strong interactions between two or more attributes from the given datasets. In this context, we present a branch and bound based decision tree algorithm to identify optimal bivariate axis-aligned splits according to a given impurity measure. In contrast to a univariate split that can be found in linear time, such an optimal cross-split has to consider every combination of values for every possible selection of pairs of attributes which leads to a combinatorial optimization problem that is quadratic in the number of values and attributes. To overcome this complexity, we use a branch and bound procedure, a well known technique from combinatorial optimization, to divide the solution space into several sets and to detect the optimal cross-splits in a short amount of time. These cross splits can either be used directly to construct quaternary decision trees or they can be used to select only the better one of the individual splits. In the latter case, the outcome is a binary decision tree with a certain sense of foresight for correlated attributes. We test both of these variants on various datasets of the UCI Machine Learning Repository and show that cross-splits can consistently produce smaller decision trees than state-of-the-art methods with comparable accuracy. In some cases, our algorithm produces considerably more accurate trees due to the ability of drawing more elaborate decisions than univariate induction algorithms.

中文翻译：

一种确定决策树归纳最佳交叉分裂的分支定界算法

最先进的决策树算法是自上而下的归纳启发式算法，它通过迭代地选择单个属性的最佳分割来贪婪地划分属性空间。尽管它们在运行时方面的表现很有吸引力，但简单的例子（例如 XOR 问题）指出，如果给定数据集中的两个或多个属性之间存在强相互作用，这些启发式方法通常无法找到最佳分类规则。在这种情况下，我们提出了一种基于分支定界的决策树算法，以根据给定的杂质度量确定最佳的双变量轴对齐分割。与可以在线性时间内找到的单变量拆分相反，这种最佳交叉拆分必须考虑每个可能的属性对选择的每个值组合，这导致组合优化问题在值和属性的数量上是二次的。为了克服这种复杂性，我们使用分支定界过程，这是组合优化中的一种众所周知的技术，将解空间划分为多个集合，并在短时间内检测到最佳交叉拆分。这些交叉拆分既可以直接用于构建四元决策树，也可以用于仅选择单个拆分中较好的一个。在后一种情况下，结果是一个二叉决策树，对相关属性具有一定的预见性。我们在 UCI 机器学习存储库的各种数据集上测试了这两种变体，并表明交叉拆分可以始终如一地生成比最先进的方法更小的决策树，并具有相当的准确性。在某些情况下，由于比单变量归纳算法能够绘制更精细的决策，我们的算法会生成更准确的树。

更新日期：2020-01-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11