JPAS: Job-progress-aware flow scheduling for deep learning clusters,Journal of Network and Computer Applications

当前位置： X-MOL 学术 › J. Netw. Comput. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

JPAS: Job-progress-aware flow scheduling for deep learning clusters
Journal of Network and Computer Applications ( IF 8.7 ) Pub Date : 2020-03-11 , DOI: 10.1016/j.jnca.2020.102590
Pan Zhou , Xinshu He , Shouxi Luo , Hongfang Yu , Gang Sun

Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.

中文翻译：

JPAS：用于深度学习集群的作业进度感知流调度

深度学习（DL）是用于大规模数据分析的越来越重要的工具，并且由于深度学习驱动的服务（例如，在线搜索和语音识别）的数量不断增加，因此在当今的生产集群中，DL工作负载也很常见。为了处理不断增长的训练数据集，通常进行分布式DL（DDL）训练以并行利用多台机器。并行训练DL模型可能会导致共享群集上出现大量带宽争用。因此，网络是分布式培训的众所周知的瓶颈。高效的网络调度对于最大化DL训练的性能至关重要。DL训练是反馈驱动的探索（例如，超参数调整，模型结构优化），这需要对深度学习模型进行多次重新培训，这些模型的配置不同。每次再培训初期的信息都可以帮助直接搜索高质量的模型。因此，减少早期时间可以加快对DL训练的探索。在本文中，我们提出了JPAS，这是一种用于DDL培训工作的流程计划系统，旨在减少早期时间。JPAS使用简单的贪婪机制来定期订购所有DDL作业。每个主机使用相应的作业顺序为其流设置优先级，并将流调度和速率分配卸载到基础的启用优先级的网络。我们在由13个服务器和一个商品交换机组成的真实测试台上评估JPAS。评估结果表明，JPAS可以将达到融合精度的90％或95％的时间减少多达38％。因此，JPAS可以显着减少早期时间，从而加快对高质量模型的搜索。

更新日期：2020-03-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>