Generalized Cost-Based Job Scheduling in Very Large Heterogenous Cluster Systems,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Generalized Cost-Based Job Scheduling in Very Large Heterogenous Cluster Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-11-01 , DOI: 10.1109/tpds.2020.2997771
Wasiur R. KhudaBukhsh , Sounak Kar , Bastian Alt , Amr Rizk , Heinz Koeppl

We study job assignment in large, heterogeneous resource-sharing clusters of servers with finite buffers. This load balancing problem arises naturally in today's communication and big data systems, such as Amazon Web Services, Network Service Function Chains, and Stream Processing. Arriving jobs are dispatched to a server, following a load balancing policy that optimizes a performance criterion such as job completion time. Our contribution is a randomized Cost-Based Scheduling (CBS) policy in which the job assignment is driven by general cost functions of the server queue lengths. Beyond existing schemes, such as the Join the Shortest Queue (JSQ), the power of

$d$

or the SQ(

$d$

) and the capacity-weighted JSQ, the notion of CBS yields new application-specific policies such as hybrid locally uniform JSQ. As today's data center clusters have thousands of servers, exact analysis of CBS policies is tedious. In this article, we derive a scaling limit when the number of servers grows large, facilitating a comparison of various CBS policies with respect to their transient as well as steady state behavior. A byproduct of our derivations is the relationship between the queue filling proportions and the server buffer sizes, which cannot be obtained from infinite buffer models. Finally, we provide extensive numerical evaluations and discuss several applications including multi-stage systems.

中文翻译：

超大型异构集群系统中基于成本的广义作业调度

我们研究了具有有限缓冲区的大型异构资源共享服务器集群中的作业分配。这种负载均衡问题自然会出现在当今的通信和大数据系统中，例如 Amazon Web Services、网络服务功能链和流处理。根据优化性能标准（例如作业完成时间）的负载平衡策略，将到达的作业分派到服务器。我们的贡献是一种基于成本的随机调度 (CBS) 策略，其中作业分配由服务器队列长度的一般成本函数驱动。除了现有的方案，例如加入最短队列 (JSQ)，

$d$

或 SQ(

$d$

) 和容量加权 JSQ，CBS 的概念产生了新的特定于应用程序的策略，例如混合本地统一 JSQ。由于当今的数据中心集群拥有数千台服务器，因此对 CBS 策略的准确分析是乏味的。在本文中，我们推导出服务器数量增长时的扩展限制，便于比较各种 CBS 策略的瞬态和稳态行为。我们推导的一个副产品是队列填充比例和服务器缓冲区大小之间的关系，这是无法从无限缓冲区模型中获得的。最后，我们提供了广泛的数值评估并讨论了包括多级系统在内的几种应用。

更新日期：2020-11-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>