JPAS: Job-progress-aware flow scheduling for deep learning clusters

https://doi.org/10.1016/j.jnca.2020.102590Get rights and content

Abstract

Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.

Introduction

Deep learning (DL) is becoming increasingly popular and is realizing substantial success in various domains, such as computer vision (Wu and ShenAnton Van Den Hengel, 2019), image processing (Zeng et al., 2019) (Zhang et al., 2018), speech recognition (Heet al., 2019). The increasing volume of big data and the increasing scale of training models (e.g., deep neural networks) significantly improve the learning performance but remarkably increase training time. To deal with ever-growing training datasets and large-scale models, researchers have designed distributed algorithms, along with flexible platforms such as TensorFlow (Agarwal, 2019) and MXNet (Li et al., 2019), for conducting efficient model training in parallel.

Many leading companies have built GPU clusters on which to conduct distributed model training over large datasets for their AI-driven services (Hazelwood et al., 2018) (Jeon et al., 2018). In addition, the clusters are shared by many users to satisfy the rising number of distributed DL (DDL) jobs (Amazon2 ElasticUs.) (U-Accelerated Microsoft) (U on Google Cloud. http). According to the traces of a Microsoft cluster, the number of DL jobs increases 10.5X each year[10]. Under the increasing workload, network competition among DDL jobs significantly impacts the efficiency of model training. The following three key characteristics of DDL jobs pose new challenges to network resource management.

DDL training is communication-intensive. The most common option for DDL training is data parallelism, in which the training dataset is divided into equal-sized parts to feed workers, and each worker occupies a GPU and works on its local copy of the DL model. The workers communicate with each other once per iteration to synchronize the model updates from other workers. After the synchronization, the workers will start the next iteration. The model synchronization generates a large amount of network traffic, which increases with the number of workers and the scale of the models. Production DL models range from a few hundreds of megabytes to a few gigabytes. Larger model sizes correspond to heavier communication overhead and longer synchronization time per iteration in distributed training.

DL training is typically iterative with diminishing returns. The training is time-consuming and typically requires multiple passes over large datasets (a pass is called an epoch). The training generates a low-accuracy model initially and improves the model's accuracy through a sequence of training iterations until it converges. The accuracy improvement diminishes as additional iterations are completed. In addition, a small number of iterations at the beginning can reach a high accuracy while many remaining iterations are spent on further improving the accuracy until convergence. For example, in Microsoft's cluster, approximately 75% of jobs reach within 0.1% of their best accuracy in 40% of the training epochs (Jeon et al., 2018).

Training of DL models is feedback-driven exploration. Training a DL model is not a one-time effort and often works in an exploratory manner. ML practitioners retrain their models repeatedly to preprocess datasets (Wang et al., 2018), tune hyper-parameters (Riquelme and TuckerJasper Snoek, 2018), and adjust model structures (Carbinet al., 2019). The objective of retraining is to obtain a final model that has the highest accuracy. Since each retraining is time-consuming, it is desirable to explore the trade-off between accuracy and training time. ML practitioners and automatic machine learning systems (AutoMLs) prefer to use information (e.g., training or validation accuracy curve) from the early training stage to predict the final performance of the model configuration (e.g., the structure of the model and the learning rate) and to make the next search decisions rather than wait a long time to train a model to convergence, which can significantly reduce the time cost of the exploratory process.

However, before training, the exact number of epochs that are required to predict the performance with high confidence is unknown (Jeon et al., 2018). Practitioners often submit DL training jobs using more epochs than necessary to obtain high confidence predictions of the final performance. Practitioners typically choose to manually monitor the training accuracy and stop the training when the accuracy curves are sufficient for predicting the final performance with high confidence (Gu et al., 2019) (Xiao et al., 2018). However, the training could take hours or even days, and it is impractical for practitioners to monitor the training accuracy in real time and to stop the training timely (Zhang et al., 2017). As a result, jobs that have received high-confidence predictions of final performance (later-stage jobs) are still running in the cluster and cost the same amounts of computational and communicational resources while making only marginal improvements in the model quality (Jeon et al., 2018). Since the cluster is shared by multiple DL jobs that have been submitted by practitioners or AutoMLs, the later-stage jobs compete for network bandwidth with jobs that have not received high-confidence predictions of final performance (early-stage jobs). Consequently, the training computation of early-stage jobs is bottlenecked by network competition. Thus, the durations of the early-stage jobs are remarkably increased, which causes the time to start the next retraining to be delayed, thereby remarkably increasing the time for searching for high-quality models.

Therefore, an effective flow schedule is desired to reduce the training time of the early training stage. However, available flow schedulers cannot realize this objective. Flow-level schedulers, such as PIAS (Bai et al., 2017) and s-PERC (Jose et al., 2019), are focused on minimizing the average flow completion time. Coflow schedulers, such as Sincronia (Agarwal et al., 2018) and PRO (Guo et al., 2019), are focused on minimizing the average coflow completion time. All these flow schedulers are unaware of the progress of DDL jobs. The flows from later-stage jobs may be preferentially scheduled over flows from early-stage jobs, which can increase the training time of the early stages. The strategy of preferentially scheduling flows from early-stage jobs is straightforward but effective in reducing the early-stage time. However, in the implementation of this strategy, three challenges are encountered.

First, it is difficult to distinguish later-stage jobs and early-stage jobs as there are no standard criteria that define the early stage and the later stage. Both manual training and automatic training have their own criteria. The criteria for manual training depend on practitioners’ domain knowledge and experience, which are highly subjective. The criteria for AutoMLs also differ. For example, AutoKeras (Jin et al., 2018) evaluates a job using the accuracy variance of recent epochs, while Google Vizier (Golovin et al., 2017) evaluates a job based on the probability of reaching the target accuracy. The probability is obtained from a probabilistic model (Domhan et al., 2015) that extrapolates the performance from the first part of the accuracy curve.

Second, it is difficult to minimize the average early-stage time as the duration or exact number of epochs of the early stage are unpredictable. Although the shortest job first (SJF) and shortest remaining time first (SRTF) algorithms are well-known to minimize the average JCT (Mao et al., 2019) (Dell'AmicoMatteo, 2019), they require jobs' running times or remaining times and thus cannot be used.

Third, it is difficult to realize a readily deployable and scalable design. To be deployable, the modifications to the upper DDL frameworks and the underlying network facilities must be minimal. Practitioners use various DDL frameworks, such as TensorFlow (Agarwal, 2019) and MXNet (Li et al., 2019). If the flow scheduling system requires many modifications to DDL frameworks, each DDL framework would be modified, which is difficult. If the flow scheduling system requires some support from the underlying network, any software or hardware modifications to switches will make the scheduling system non-deployable as customized switches are highly expensive. To be scalable, the overload that is caused by the scheduling system must be minimal. The per-flow rate allocation that is adopted by most prior designs requires the reallocation of the rate for every flow in the network when a flow arrives or departs. This reallocation is impractical in large-scale clusters in which thousands of flows may arrive each second. Considering scalability and deployability, practical design for scheduling flows remains elusive.

To overcome these challenges, we propose a novel job-progress-aware scheduler (JPAS) system by predicting the potential accuracy improvement of DL jobs and by applying the maximum-accuracy-improvement-first (MAIF) scheduling policy on available priority-enabled network. JPAS orders jobs according to their potential accuracy improvement in the following scheduling period and assigns priority to a flow based on its job order. A job that has higher potential accuracy improvement has higher order; thus, its flows have higher priority. Since the accuracy improvement diminishes as the number of iterations increases, the later-stage jobs have much smaller accuracy improvement than the early-stage jobs (Jeon et al., 2018) (Zhang et al., 2017). Thus, the early-stage jobs are ordered ahead of the later-stage jobs. Then, JPAS can schedule the early-stage jobs ahead of later-stage jobs. In addition, the early-stage jobs with higher accuracy improvement are preferentially scheduled over early-stage jobs with lower accuracy improvement, which can help to reduce the average early-stage time.

JPAS predicts the potential accuracy improvement by using accuracy curves at the early stage, and the data of the accuracy curves can be obtained by reading log files of DDL jobs, which avoids modifying the upper DL frameworks. Moreover, JPAS maps flows to corresponding priority queues of the underlying network and offloads the flow scheduling and rate allocation to the underlying priority-enabled network, which requires no modification to the underlying network. Finally, as DDL jobs are long-term-running jobs, JPAS can update the job orders for a long period (e.g., tens of minutes), which has ultra-low overhead and can be scalable.

This mechanism may be not optimal, but our implementation and experiment show that JPAS performs very well. We have implemented JPAS on top of TCP, with Diff-Serv (Chan et al., 2006) for priority scheduling. Our implementation is work-conserving and efficiently handles the online arrival of flows. We evaluate the JPAS implementation on a 13-server testbed that is interconnected with a 1-Gbps commodity Ethernet switch. We use MXNet (Li et al., 2019) to conduct DDL training using a variety of models and datasets. Our experimental results demonstrate that JPAS can reduce the time to reach 90%/95% of the convergence accuracy by up to 38%.

The key contributions of the paper are as follows:

  • We propose minimizing the time of the early training stages via flow scheduling, which can significantly accelerate the exploratory process of deep learning.

  • We are the first to consider the training progress and propose an MAIF policy for scheduling flows for DDL jobs, which can overcome the challenge of unknown duration times or epochs of early training stages when minimizing the early stage times.

  • We propose a method for predicting the accuracy improvements of DDL jobs, which is important for the MAIF policy.

  • We implement MAIF and the method of accuracy prediction in a system, namely, JPAS, which is practical and readily deployable. JPAS does not require any modification to the upper ML frameworks or underlying network facilities.

The remainder of the paper is organized as follows: §2 overviews the background and the motivation of our work. §3 provides an overview of our scheduling system, namely, JPAS. Then, §4 describes the key techniques that are adopted by JPAS. §5 evaluates JPAS via testbed experiments. §6 discusses the related works, and §7 concludes the paper.

Section snippets

Background and motivation

This section provides background and motivation for JPAS. §2.1 discusses how DDL jobs are distributed in the GPU clusters. §2.2 discusses the iterative characteristic and the exploratory process of DL training. §2.3 introduces the challenges of network scheduling among DDL jobs and provides the key strategies of our solution, namely, JPAS. §2.4 discusses the potential benefits of JPAS.

Overview of JPAS

For realizing and evaluating MAIF, we have developed a system, namely, JPAS. JPAS is readily deployable and scalable, as it requires no modifications to upper DDL frameworks and underlying networks. JPAS obtains training information by reading log files of DDL jobs and conducts priority scheduling by using the underlying priority-enabled network. JPAS orders jobs according to their potential accuracy improvement in the following scheduling period, and assigns priority to a flow based on its job

Prediction of the accuracy improvement

To predict the accuracy improvement, JPAS estimates the training velocity and use it to predict the number of iterations in the next scheduling period if the job monopolizes all bandwidth, and JPAS uses a curve model to predict the training accuracy of future iterations. Then, the difference between the accuracy at the end of the next scheduling period and the current accuracy is the accuracy improvement. The model of training-accuracy curves is presented in §4.1. We demonstrate the estimation

Experimental settings

Testbed. We built a testbed that consists of 10 CPU servers and 3 GPU servers that are connected to a 48-port Edge-core AS4610-54 T 1 Gb E switch, as shown in Fig. 14. Each port of the switch has 8 priority queues. Each CPU server has two 8-core Intel E5-2609 CPUs, 32 GB memory, and two 300 GB HDDs. Each GPU server has one 8-core Intel E5-2650 CPU, two NVIDIA 1080Ti GPUs, 64 GB memory, one 500 GB SSD and one 4 TB HDD. We use Kubernetes 1.7 (Kubernetes. https://kuber, 2017) to schedule the

Related work

Distributed learning frameworks. The most common option for DDL training is data parallelism, where the training dataset is divided into equal-sized parts to feed workers. The workers synchronize the model updates from other workers once per iteration by using MPI-Allreduce (IBM Spectrum MPI, 2017) or parameter servers (Li et al., 2014). Most distributed ML/DL frameworks (e.g., MXNet (Li et al., 2019), TensorFlow (Agarwal, 2019) and Angel (Jiang et al., 2017)) employ the parameter server (PS)

Conclusion

In this paper, we show that DDL jobs in GPU clusters compete for network bandwidth and that the network becomes one of the major bottlenecks for DDL training. Thus it is urgent to implement an efficient flow scheduling method to improve the training efficiency. We also show that deep learning requires the minimization of the durations of the early training stages to accelerate the exploratory process. However, the most formidable challenge is that the exact durations and numbers of epochs of

Author contribution statement

Pan Zhou: Conceptualization, Methodology, Software, Writing. Xinshu He: Software., Shouxi Luo: Writing., Hongfang Yu: Supervision. Gang Sun: Writing, Supervision.

Acknowledgment

This research was partially supported by the National Key Research and Development Program of China (2019YFB1802800), PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (PCL2018KP001), Fundamental Research Funds for the Central Universities (2682019CX61), and China Postdoctoral Science Foundation (2019M663552).

Pan Zhou is a PhD candidate of Communication and Information systems at University of Electronic Science and Technology of China. His research areas include networks, cloud computing, distributed systems and machine learning.

References (54)

  • Yoshua Bengio et al.
    (2017)
  • Léon Bottou et al.

    Optimization methods for large-scale machine learning

    SIAM Rev.

    (2018)
  • Michael Carbin

    "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural networks." the Seventh International Conference on Learning Representations

    (2019)
  • Kwok Ho Chan et al.

    Configuration guidelines for Diff Serv service classes

    (2006)
  • Chen Chen et al.

    Round-robin synchronization: mitigating communication bottlenecks in parameter servers

    -IEEE Conference on Computer Communications

    (2019)
  • Dell'Amico et al.

    Scheduling with Inexact Job Sizes: the Merits of Shortest Processing Time First

    (2019)
  • Fahad R. Dogar et al.

    Decentralized task-aware scheduling for data center networks

    (2014)
  • Tobias Domhan et al.

    Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves

  • Jared Dunnmon et al.

    Cross-modal data programming enables rapid medical machine learning

    (2019)
  • Daniel Golovin et al.

    Google vizier: a service for black-box optimization

  • Juncheng Gu et al.

    Tiresias: a GPU cluster manager for distributed deep learning

  • Kim Hazelwood et al.

    Applied machine learning at Facebook: a datacenter infrastructure perspective

  • Yanzhang He

    Streaming end-to-end speech recognition for mobile devices

  • Jeon, Myeongjae, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. Multi-tenant GPU...
  • Myeongjae Jeon et al.

    Analysis of large-scale multi-tenant GPU clusters for DNN training workloads

    (2019)
  • Jie Jiang et al.

    Angel: a New Large-Scale Machine Learning System

    (2017)
  • Haifeng Jin et al.

    Efficient neural architecture search with network morphism

    (2018)
  • Cited by (9)

    • OSDL: Dedicated optical slice provisioning in support of distributed deep learning

      2022, Computer Networks
      Citation Excerpt :

      Network scheduling is in a core position, as the network is the main bottleneck of the DDL. JPAS [27] simply scheduled the flow priorities to match the priorities of multiple DDL jobs. Geryon [28] followed the similar idea to coordinate multiple parameter servers and prioritize the urgent parameter transfers in the entire network fabrics.

    • Optimizing multicast flows in high-bandwidth reconfigurable datacenter networks

      2022, Journal of Network and Computer Applications
      Citation Excerpt :

      In many of these applications, multicasts are frequent and high-volume and are becoming a bottleneck (Shahbaz et al., 2019; Oracle, 2014; Xia et al., 2015; Sun et al., 2018; Sun and Ng, 2017). For example, in distributed machine learning frameworks, the learning model has to be frequently communicated among all computation nodes during the long-term training process (Mai et al., 2015; Zhou et al., 2020). Despite the wide use of multicast, there is currently no support among cloud providers for efficient multicasting: a missed opportunity (Oracle, 2014; Shahbaz et al., 2019; Lin et al., 2017).

    • Modoru: Clos nanosecond optical switching for distributed deep training [Invited]

      2024, Journal of Optical Communications and Networking
    • Selective Preemption of Distributed Deep Learning Training

      2023, IEEE International Conference on Cloud Computing, CLOUD
    View all citing articles on Scopus

    Pan Zhou is a PhD candidate of Communication and Information systems at University of Electronic Science and Technology of China. His research areas include networks, cloud computing, distributed systems and machine learning.

    Xinshu He is a master student of Communication and Information systems at University of Electronic Science and Technology of China. His research areas include deep learning and distributed systems.

    Shouxi Luo received his B.S. degree in Communication Engineering and Ph.D. degree in Communication and Information System from University of Electronic Science and Technology of China in 2011 and 2016, respectively. From Oct. 2015 to Sep. 2016, he was an Academic Guest at the Department of Information Technology and Electrical Engineering, ETH Zurich. His research interests include data center networks and software-defined networks.

    Hongfang Yu received her B.S. degree in Electrical Engineering in 1996 from Xidian University, her M.S. degree and Ph.D. degree in Communication and Information Engineering in 1999 and 2006 from University of Electronic Science and Technology of China, respectively. From 2009 to 2010, she was a Visiting Scholar at the Department of Computer Science and Engineering, University at Buffalo (SUNY). Her research interests include network survivability, network security and next generation Internet.

    Gang Sun is an associate professor of Computer Science at University of Electronic Science and Technology of China (UESTC). His research interests include network virtualization, cloud computing, high performance computing and cyber security.

    View full text