Label-Aware Distributed Ensemble Learning: A Simplified Distributed Classifier Training Model for Big Data,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Label-Aware Distributed Ensemble Learning: A Simplified Distributed Classifier Training Model for Big Data
Big Data Research ( IF 3.5 ) Pub Date : 2018-11-16 , DOI: 10.1016/j.bdr.2018.11.001
Shadi Khalifa , Patrick Martin , Rebecca Young

Label-Aware Distributed Ensemble Learning (LADEL) is a programming model and an associated implementation for distributing any classifier training to handle Big Data. It only requires users to specify the training data source, the classification algorithm and the desired parallelization level. First, a distributed stratified sampling algorithm is proposed to generate stratified samples from large, pre-partitioned datasets in a shared-nothing architecture. It executes in a single pass over the data and minimizes inter-machine communication. Second, the specified classification algorithm training is parallelized and executed on any number of heterogeneous machines. Finally, the trained classifiers are aggregated to produce the final classifier.

Data miners can use LADEL to run any classification algorithm on any distributed framework, without any experience in parallel and distributed systems. The proposed LADEL model can be implemented on any distributed framework (Drill, Spark, Hadoop, etc.) to speed up the development of its data mining capabilities. It is also generic and can be used to distribute the training of any classification algorithm of any sequential single-node data mining library (Weka, R, scikit-learn, etc.). Distributed frameworks can implement LADEL to distribute the execution of existing data mining libraries without rewriting the algorithms to run in parallel.

As a proof-of-concept, the LADEL model is implemented on Apache Drill to distribute the training execution of Weka's classification algorithms. Our empirical studies show that LADEL classifiers have similar and sometimes even better accuracy to the single-node classifiers and they have a significantly faster training and scoring times.

中文翻译：

标签感知的分布式集成学习：大数据的简化的分布式分类器训练模型

标签感知的分布式集成学习（LADEL）是一种编程模型和相关的实现，用于分发任何分类器训练以处理大数据。它仅要求用户指定训练数据源，分类算法和所需的并行化级别。首先，提出了一种分布式分层抽样算法，该算法从无共享架构中的大型预分区数据集中生成分层样本。它一次处理数据，从而最大程度地减少了机器间的通信。其次，指定的分类算法训练在任意数量的异构机器上并行执行。最后，将训练有素的分类器汇总以产生最终分类器。

数据挖掘者可以使用LADEL在任何分布式框架上运行任何分类算法，而无需在并行和分布式系统中拥有任何经验。可以在任何分布式框架（Drill，Spark，Hadoop等）上实现建议的LADEL模型，以加快其数据挖掘功能的开发。它也是通用的，可用于分发任何顺序单节点数据挖掘库（Weka，R，scikit-learn等）的任何分类算法的训练。分布式框架可以实现LADEL来分布现有数据挖掘库的执行，而无需重写算法以并行运行。

作为概念验证，LADEL模型在Apache Drill上实现，以分发Weka分类算法的训练执行。我们的经验研究表明，LADEL分类器与单节点分类器的准确性相似，有时甚至更高，并且训练和评分时间明显更快。

更新日期：2018-11-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文