当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-01-19 , DOI: 10.1109/tpds.2021.3052895
Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

Efficient resource scheduling is essential for maximal utilization of expensive deep learning (DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) workload characteristics, or use scheduling heuristics based on operators’ understanding of particular ML framework and workload, which are less efficient or not general enough. In this article, we show that DL techniques can be adopted to design a generic and efficient scheduler. Specifically, we propose DL $^2$ , a DL-driven scheduler for DL clusters, targeting global training job expedition by dynamically resizing resources allocated to jobs. DL $^2$ advocates a joint supervised learning and reinforcement learning approach: a neural network is warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then the neural network is plugged into the live DL cluster, fine-tuned by reinforcement learning carried out throughout the training progress of the DL jobs, and used for deciding job resource allocation in an online fashion. We implement DL $^2$ on Kubernetes and enable dynamic resource scaling in DL jobs on MXNet. Extensive evaluation shows that DL $^2$ outperforms fairness scheduler (i.e., DRF) by 44.1 percent and expert heuristic scheduler (i.e., Optimus) by 17.5 percent in terms of average job completion time.

中文翻译:

DL2:深度学习驱动的深度学习集群调度程序

有效的资源调度对于最大程度地利用昂贵的深度学习(DL)群集至关重要。现有的集群调度程序或者不了解机器学习(ML)的工作负载特征,或者基于调度员对特定ML框架和工作负载的理解而使用调度试探法,效率较低或不够通用。在本文中,我们展示了可以采用DL技术来设计通用且高效的调度程序。具体来说,我们提出DL $ ^ 2 $ ,这是DL群集的DL驱动的调度程序,它通过动态调整分配给作业的资源的大小来针对全局培训作业远征。DL $ ^ 2 $提倡一种联合监督学习和强化学习的方法:基于现有集群调度程序产生的工作轨迹,通过离线监督学习来预热神经网络;然后将神经网络插入实时DL群集中,通过在DL作业的整个培训过程中进行的强化学习进行微调,并用于在线确定作业资源分配。我们实现DL $ ^ 2 $在Kubernetes上启用,并在MXNet上的DL作业中启用动态资源扩展。广泛的评估表明DL $ ^ 2 $ 就平均工作完成时间而言,它比公平性计划程序(即DRF)高出44.1%,优于专家启发式计划程序(即Optimus)高出17.5%。
更新日期:2021-02-23
down
wechat
bug