Performance-aware Speculative Resource Oversubscription for Large-scale Clusters,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance-aware Speculative Resource Oversubscription for Large-scale Clusters
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-07-01 , DOI: 10.1109/tpds.2020.2970013
Renyu Yang , Chunming Hu , Xiaoyang Sun , Peter Garraghan , Tianyu Wo , Zhenyu Wen , Hao Peng , Jie Xu , Chao Li

It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone.

中文翻译：

大规模集群的性能感知投机资源超额订阅

在集群调度中实现高度的资源利用率是一个长期存在的挑战。资源超额认购已成为提高资源利用率和降低成本的普遍做法。然而，当前的集中式超额订阅方法存在资源不匹配的问题，并且无法考虑其他性能要求，例如尾部延迟。在本文中，我们介绍了 ROSE，这是一个新的资源管理平台，能够进行性能感知资源超额订阅。ROSE 允许对延迟敏感的长时间运行应用程序 (LRA) 与计算密集型批处理作业共存。而不是等待集中调度器确认资源分配，ROSE 中的作业管理器可以根据其是否适合超额订阅，独立请求在特定机器内启动推测性任务。然而，这些机器的节点代理可以通过使用多资源阈值控制和性能感知资源节流阀的准入控制机制来避免任何过度的资源超额订阅。实验表明，在批处理作业和延迟敏感的 LRA 混合共存的情况下，CPU 利用率和磁盘利用率分别可以达到 56.34% 和 43.49%，但 YCSB 工作负载中读取延迟的第 95 个百分点仅增加了 5.4%反对单独执行 LRA 的情况。通过使用多资源阈值控制和性能感知资源节流的准入控制机制，避免任何过度的资源超额订阅。实验表明，在批处理作业和延迟敏感的 LRA 混合共存的情况下，CPU 利用率和磁盘利用率分别可以达到 56.34% 和 43.49%，但 YCSB 工作负载中读取延迟的第 95 个百分点仅增加了 5.4%反对单独执行 LRA 的情况。通过使用多资源阈值控制和性能感知资源节流的准入控制机制，避免任何过度的资源超额订阅。实验表明，在批处理作业和延迟敏感的 LRA 混合共存的情况下，CPU 利用率和磁盘利用率分别可以达到 56.34% 和 43.49%，但 YCSB 工作负载中读取延迟的第 95 个百分点仅增加了 5.4%反对单独执行 LRA 的情况。

更新日期：2020-07-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>