On Distributed Fuzzy Decision Trees for Big Data,IEEE Transactions on Fuzzy Systems

当前位置： X-MOL 学术 › IEEE Trans. Fuzzy Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On Distributed Fuzzy Decision Trees for Big Data
IEEE Transactions on Fuzzy Systems ( IF 11.9 ) Pub Date : 2018-02-01 , DOI: 10.1109/tfuzz.2016.2646746
Armando Segatori , Francesco Marcelloni , Witold Pedrycz

Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis.

中文翻译：

大数据的分布式模糊决策树

模糊决策树 (FDT) 已被证明是模糊分类框架中的有效解决方案。然而，迄今为止提出的 FDT 学习方法通常忽略了时间和空间要求。在本文中，我们提出了一种根据 MapReduce 编程模型形成的分布式 FDT 学习方案，用于从大数据生成二进制和多路 FDT。该方案依赖于一种新颖的分布式模糊离散器，该离散器基于模糊信息熵为每个连续属性生成一个强模糊分区。因此，模糊分区用作 FDT 学习算法的输入，该算法采用模糊信息增益来选择决策节点的属性。我们已经在 Apache Spark 框架上实现了 FDT 学习方案。我们使用了 10 个真实世界公开可用的大数据集来从三个维度评估该方案的行为：1) 在分类准确度、模型复杂度和执行时间方面的性能；2) 不同计算单元数量的可扩展性；3) 有效适应不断增加的数据集大小的能力。我们已经证明，即使有适度的商品硬件支持，所提出的方案也适用于管理大型数据集。最后，我们使用了在 MLLib 库中实现的分布式决策树学习算法和基于 MapReduce 分布式模糊规则的分类系统 Chi-FRBCS-BigData 算法进行比较分析。和执行时间；2) 计算单元数量不同的可扩展性；3) 有效适应不断增加的数据集大小的能力。我们已经证明，即使有适度的商品硬件支持，所提出的方案也适用于管理大型数据集。最后，我们使用了在 MLLib 库中实现的分布式决策树学习算法和基于 MapReduce 分布式模糊规则的分类系统 Chi-FRBCS-BigData 算法进行比较分析。和执行时间；2) 计算单元数量不同的可扩展性；3) 有效适应不断增加的数据集大小的能力。我们已经证明，即使有适度的商品硬件支持，所提出的方案也适用于管理大型数据集。最后，我们使用了在 MLLib 库中实现的分布式决策树学习算法和基于 MapReduce 分布式模糊规则的分类系统 Chi-FRBCS-BigData 算法进行比较分析。

更新日期：2018-02-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>