当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Model averaging in distributed machine learning: a case study with Apache Spark
The VLDB Journal ( IF 4.2 ) Pub Date : 2021-04-15 , DOI: 10.1007/s00778-021-00664-7
Yunyan Guo , Zhipeng Zhang , Jiawei Jiang , Wentao Wu , Ce Zhang , Bin Cui , Jianzhong Li

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.



中文翻译:

分布式机器学习中的模型平均:以Apache Spark为例的案例研究

Apache Spark的日益普及吸引了许多用户将其数据放入其生态系统。另一方面,在文献中已经证明Spark在分布式机器学习(ML)方面很慢。一种手段是切换到声称具有更好性能的专用系统,例如参数服务器。但是,用户必须经历将数据移入和移出Spark的痛苦过程。在本文中,我们将重点分析MLlib(随机梯度下降(SGD)的实现)的性能瓶颈,而MLlib是ML的官方Spark软件包,SGD是许多ML模型训练下的主力。我们表明,Spark的性能劣势是由实现问题引起的,而不是由控制Spark执行的大批量同步并行(BSP)模型的基本缺陷引起的:我们可以通过利用众所周知的“模型平均”(MA)来显着提高Spark的性能。分布式ML中的技术。的确,模型平均不仅限于SGD,而且我们进一步展示了MA在Spark内部训练潜在Dirichlet分配(LDA)模型中的应用。我们的实现不是侵入性的,需要少量的开发工作。实验评估结果表明,与不使用MA的同类产品相比,基于MA的SGD和LDA版本可以快几个数量级。通过利用分布式ML中众所周知的“模型平均”(MA)技术,我们可以显着提高Spark的性能。的确,模型平均不仅限于SGD,而且我们进一步展示了MA在Spark内部训练潜在Dirichlet分配(LDA)模型中的应用。我们的实现不是侵入性的,需要少量的开发工作。实验评估结果表明,与不使用MA的同类产品相比,基于MA的SGD和LDA版本可以快几个数量级。通过利用分布式ML中众所周知的“模型平均”(MA)技术,我们可以显着提高Spark的性能。的确,模型平均不仅限于SGD,而且我们进一步展示了MA在Spark内部训练潜在Dirichlet分配(LDA)模型中的应用。我们的实现不是侵入性的,需要少量的开发工作。实验评估结果表明,与不使用MA的同类产品相比,基于MA的SGD和LDA版本可以快几个数量级。

更新日期:2021-04-16
down
wechat
bug