当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Online job scheduling for distributed machine learning in optical circuit switch networks
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-05-12 , DOI: 10.1016/j.knosys.2020.106002
Ling Liu , Hongfang Yu , Gang Sun , Huaman Zhou , Zonghang Li , Shouxi Luo

Networking has become a well-known performance bottleneck for distributed machine learning (DML). Although lots of works have focused on accelerating the communication process of DML, they ignore the impact of the physical network on the DML performance. Concurrently, optical circuit switches (OCSes) are increasingly applied in data centers and clusters, which can fundamentally improve DML performance. It is worth noting that the non-negligible OCS reconfiguration delay makes OCS scheduling algorithms have a great impact on the upper application performance. However, existing OCS scheduling solutions are not suitable for DML jobs due to the iterative nature of DML jobs and their interleaving characteristics of communication and computation stages. Therefore, in this paper, we study the online multi-job scheduling for DML in OCS networks. Firstly, we propose heaviest-load-first (HLF), a heuristic algorithm for intra-job scheduling, which is based on the fact that the completion time of flows on the heaviest load port has a significant impact on the job completion time. Furthermore, we present Shortest Weighted Remaining Time First (SWRTF) algorithm for inter-job scheduling. In SWRTF, an available DML job is scheduled when the served job moves from communication stage to the computation stage, which significantly improves the circuit utilization. Based on large-scale simulations, we demonstrate HLF can significantly reduce the iteration communication time by up to 64.97% compared to the state-of-the-art circuit scheduler Sunflow. Besides, SWRTF can save up to 42.9%, 54.2%, 27.2% of Weighted-Job-Completion-Time (WJCT) compared to Shortest-Job-First, Baraat and Weighted-First inter-job scheduling algorithms, respectively.



中文翻译:

光电路交换网络中分布式机器学习的在线作业调度

网络已成为分布式机器学习(DML)的众所周知的性能瓶颈。尽管许多工作致力于加速DML的通信过程,但他们忽略了物理网络对DML性能的影响。同时,光电路交换机(OCS)越来越多地应用于数据中心和集群中,这可以从根本上改善DML性能。值得注意的是,不可忽略的OCS重新配置延迟使OCS调度算法对较高的应用程序性能有很大的影响。但是,由于DML作业的迭代性质以及它们在通信和计算阶段的交错特性,现有的OCS调度解决方案不适合DML作业。因此,在本文中,我们研究了OCS网络中DML的在线多作业调度。首先,我们提出最重负载优先(HLF),一种用于作业内调度的启发式算法,其基于以下事实:最重负载端口上的流完成时间对作业完成时间有重大影响。此外,我们提出了用于作业间调度的最短加权剩余时间优先(SWRTF)算法。在SWRTF中,当所服务的作业从通信阶段移至计算阶段时,将调度可用的DML作业,从而显着提高电路利用率。基于大规模仿真,我们证明,与最新的电路调度程序Sunflow相比,HLF可以显着减少迭代通信时间达64.97%。此外,与最短优先工作,Baraat和加权优先工作间调度算法相比,SWRTF最多可以节省42.9%,54.2%和27.2%的加权工作完成时间(WJCT)。

更新日期:2020-05-12
down
wechat
bug