当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Job scheduling for distributed machine learning in optical WAN
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2020-06-09 , DOI: 10.1016/j.future.2020.06.007
Ling Liu , Hongfang Yu , Gang Sun , Long Luo , Qixuan Jin , Shouxi Luo

Large companies operate tens of data centers (DCs) across the globe to serve their customers and store data. On the other hand, many machine learning applications need a global view of such global data to pursue high model accuracy. However, for this Geo-distributed machine learning (Geo-DML), it is infeasible to move all data together over wide-area networks (WANs) due to scarce WAN bandwidth, privacy concerns and data sovereignty laws. Therefore, most Geo-DML systems leverage geo-distributed approaches to train models, where global model synchronization is required between DCs over WAN. With the rapid increase of training data and the model sizes, it is challenging to efficiently utilize scarce and heterogeneous WAN bandwidth to synchronize models. With the advancement of optical technology, network topology becomes reconfigurable in optical WAN, which brings a new opportunity for Geo-DML training over WAN.

We propose to optimize Geo-DML training with centralized joint control of the network and reconfigurable optical layers. We respectively prove the intra-job and inter-job scheduling problems are NP-hard and strongly NP-hard. For intra-job scheduling, RoWAN based on deterministic rounding algorithm, is presented to dynamically change the topology by reconfiguring the optical devices, and allocate path and rate for each flow. For inter-job scheduling, delayed SWRT is provided to schedule multiple jobs according to their priorities. The simulations in real topologies show that RoWAN reduces global model synchronization communication time of single iteration by up to 15.54%-48.2% on average in comparison with the traditional solutions. Compared to other three inter-job scheduling approaches, delayed SWRT can reduce the weighted job completion time (WJCT) by about 60%, 44.8% and 28.76%.



中文翻译:

光学WAN中分布式机器学习的作业调度

大型公司在全球运营着数十个数据中心(DC),以为其客户提供服务并存储数据。另一方面,许多机器学习应用程序需要这种全局数据的全局视图来追求较高的模型精度。但是,对于这种地理分布式机器学习(Geo-DML),由于广域网带宽不足,隐私问题和数据主权法律的原因,无法在广域网(WAN)上一起移动所有数据。因此,大多数Geo-DML系统利用地理分布的方法来训练模型,其中需要通过WAN的DC之间进行全局模型同步。随着训练数据和模型大小的迅速增加,有效利用稀缺的异构WAN带宽来同步模型具有挑战性。随着光学技术的进步,光学WAN中的网络拓扑变得可重新配置,

我们建议通过对网络和可重新配置的光学层进行集中联合控制来优化Geo-DML训练。我们分别证明了工作中和工作间的调度问题是NP-hard强NP-hard。对于作业内调度,提出了基于确定性舍入算法的RoWAN,通过重新配置光学设备来动态更改拓扑,并为每个流分配路径和速率。对于作业间调度,提供了延迟的SWRT以根据优先级调度多个作业。实际拓扑中的仿真表明,与传统解决方案相比,RoWAN可以将单迭代的全局模型同步通信时间平均缩短多达15.54%-48.2%。与其他三种作业间调度方法相比,延迟的SWRT可以将加权作业完成时间(WJCT)减少约60%,44.8%和28.76%。

更新日期:2020-06-09
down
wechat
bug