Building efficient fuzzy regression trees for large scale and high dimensional problems,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Building efficient fuzzy regression trees for large scale and high dimensional problems
Journal of Big Data ( IF 8.1 ) Pub Date : 2018-12-12 , DOI: 10.1186/s40537-018-0159-y
Javier Cózar , Francesco Marcelloni , José A. Gámez , Luis de la Ossa

Regression trees (RTs) are simple, but powerful models, which have been widely used in the last decades in different scopes. Fuzzy RTs (FRTs) add fuzziness to RTs with the aim of dealing with uncertain environments. Most of the FRT learning approaches proposed in the literature aim to improve the accuracy, measured in terms of mean squared error, and often neglect to consider the computation time and/or the memory requirements. In today’s application domains, which require the management of huge amounts of data, this carelessness can strongly limit their use. In this paper, we propose a distributed FRT (DFRT) learning scheme for generating binary RTs from big datasets, that is based on the MapReduce paradigm. We have designed and implemented the scheme on the Apache Spark framework. We have used eight real-world and four synthetic datasets for evaluating its performance, in terms of mean squared error, computation time and scalability. As a baseline, we have compared the results with the distributed RT (DRT) and the Distributed Random Forest (DRF) available in the Spark MLlib library. Results show that our DFRT scales similarly to DRT and better than DRF. Regarding the performance, DFRT generalizes much better than DRT and similarly to DRF.

中文翻译：

为大型和高维问题构建有效的模糊回归树

回归树（RTs）是简单但功能强大的模型，在过去的几十年中已在不同范围内广泛使用。模糊RT（FRT）为RT添加模糊性，以应对不确定的环境。文献中提出的大多数FRT学习方法旨在提高精度（以均方误差为单位），并且常常忽略了计算时间和/或存储要求。在当今需要管理大量数据的应用程序域中，这种粗心大意会严重限制其使用。在本文中，我们提出了一种基于MapReduce范式的分布式FRT（DFRT）学习方案，用于从大型数据集中生成二进制RT。我们已经在Apache Spark框架上设计并实现了该方案。在均方误差，计算时间和可伸缩性方面，我们已经使用了八个真实世界和四个合成数据集来评估其性能。作为基准，我们将结果与Spark MLlib库中可用的分布式RT（DRT）和分布式随机森林（DRF）进行了比较。结果表明，我们的DFRT的缩放比例与DRT相似，并且优于DRF。在性能方面，DFRT的概括性要比DRT更好，并且与DRF类似。

更新日期：2018-12-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>