Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster
arXiv - CS - Performance Pub Date : 2020-10-21 , DOI: arxiv-2010.11307
Ying Mao, Yuqi Fu, Wenjia Zheng, Long Cheng, Qingzhi Liu, and Dingwen Tao

In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost the system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this project, we propose SpeCon, a novel container scheduler that is optimized for shortlived deep learning applications. Based on virtualized containers, such as Kubernetes [4] and Docker [5], SpeCon analyzes the common characteristics of training processes. We design a suite of algorithms to monitor the progress of the training and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves the completion time of an individual job by up to 41.5%, 14.8% system-wide and 24.7% in terms of makespan.

中文翻译：

Kubernetes 集群中深度学习应用的推测容器调度

在过去十年中，我们目睹了从各种来源收集的数据量急剧增加。数据的爆炸式增长已经改变了世界，因为可供收集和分析的信息比以往任何时候都多。为了最大限度地提高利用率，已经开发了各种机器和深度学习模型，例如 CNN [1] 和 RNN [2]，以从不同角度研究数据并提取有价值的信息。虽然数据驱动的应用程序改进了无数产品，但训练超参数调整模型仍然是一个耗时且资源密集型的过程。云计算为深度学习应用的训练提供基础设施支持。云服务提供商，例如 Amazon Web Services [3]，为客户端创建一个隔离的虚拟环境（虚拟机和容器），谁共享物理资源，例如 CPU 和内存。在云端，实施资源管理方案，以实现用户之间更好的共享并提升系统范围的性能。然而，一般的调度方法，例如分散优先级和平衡资源调度器，不适用于深度学习工作负载。在这个项目中，我们提出了 SpeCon，这是一种针对短期深度学习应用程序进行优化的新型容器调度程序。SpeCon 基于虚拟化容器，如 Kubernetes [4] 和 Docker [5]，分析了训练过程的共同特征。我们设计了一套算法来监控训练的进度，并推测性地迁移缓慢增长的模型来为快速增长的模型释放资源。具体来说，

更新日期：2020-10-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文