当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Online Placement and Scaling of Geo-distributed Machine Learning Jobs via Volume-discounting Brokerage
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-04-01 , DOI: 10.1109/tpds.2019.2955935
Xiaotong Li , Ruiting Zhou , Lei Jiao , Chuan Wu , Yuhang Deng , Zongpeng Li

Geo-distributed machine learning (ML) often uses large geo-dispersed data collections produced over time to train global models, without consolidating the data to a central site. In the parameter server architecture, “workers” and “parameter servers” for a geo-distributed ML job should be strategically deployed and adjusted on the fly, to allow easy access to the datasets and fast exchange of the model parameters at any time. Despite many cloud platforms now provide volume discounts to encourage the usage of their ML resources, different geo-distributed ML jobs that run in the clouds often rent cloud resources separately and respectively, thus rarely enjoying the benefit of discounts. We study an ML broker service that aggregates geo-distributed ML jobs into cloud data centers for volume discounts via dynamic online placement and scaling of workers and parameter servers in individual jobs for long-term cost minimization. To decide the number and the placement of workers and parameter servers, we propose an efficient online algorithm which first decomposes the online problem into a series of one-shot optimization problems solvable at each individual time slot by the technique of regularization, and afterwards round the fractional decisions to the integer ones via a carefully-designed dependent rounding method. We prove a parameterized-constant competitive ratio for our online algorithm as the theoretical performance analysis, and also conduct extensive simulation studies to exhibit its close-to-offline-optimum practical performance in realistic settings.

中文翻译:

通过批量折扣代理在线放置和扩展地理分布式机器学习工作

地理分布式机器学习 (ML) 通常使用随时间产生的大型地理分散数据集合来训练全球模型,而不会将数据整合到中央站点。在参数服务器架构中,地理分布式机器学习作业的“工人”和“参数服务器”应该战略性地部署和动态调整,以便随时轻松访问数据集和快速交换模型参数。尽管现在许多云平台提供批量折扣以鼓励使用其 ML 资源,但在云中运行的不同地理分布式 ML 作业通常会分别租用云资源,因此很少享受折扣的好处。我们研究了一项 ML 代理服务,该服务通过动态在线放置和扩展单个作业中的工作人员和参数服务器,将地理分布的 ML 作业聚合到云数据中心以获得批量折扣,以实现长期成本最小化。为了决定工人和参数服务器的数量和位置,我们提出了一种高效的在线算法,首先将在线问题分解为一系列可在每个单独的时间段通过正则化技术解决的一次性优化问题,然后将通过精心设计的相关舍入方法对整数进行小数决策。我们证明了我们的在线算法的参数化恒定竞争率作为理论性能分析,
更新日期:2020-04-01
down
wechat
bug