Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces,Sustainable Computing: Informatics and Systems

当前位置： X-MOL 学术 › Sustain. Comput. Inform. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
Sustainable Computing: Informatics and Systems ( IF 3.8 ) Pub Date : 2021-03-03 , DOI: 10.1016/j.suscom.2021.100528
Brad Everman ₁ , Narmadha Rajendran ₁ , Xiaomin Li ₁ , Ziliang Zong ₁

Affiliation

The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provide new or significantly improve existing cloud services that can better support hybrid workloads (e.g. online purchase jobs and batch jobs that support ranking or recommendation systems). Successful cloud systems need to efficiently handle and quickly respond to huge volume of traffic with such hybrid workloads. Meanwhile, it is critical to reduce the total cost of ownership (TCO) for profitability. Improving system utilization is one of the effective techniques to achieve the twin goals of high performance and low TCO. This paper conducts a comprehensive analysis on the 2017 and 2018 cluster traces released by Alibaba, which provides a case study about Alibaba's best practices in improving the performance and cost efficiency of its large-scale cloud systems by consolidating time-sensitive online service jobs with time-insensitive batch jobs. Our investigation indicates that the over-subscription (causing resource waste and low utilization) and under-subscription (causing performance degradation) problems co-exist in the current Alibaba system. We develop a simulator that allows us to evaluate possible solutions to address this problem and their impact on the performance, energy consumption, and TCO. Our experiments show that the estimated TCO can be reduced by $600,000 for the 2018 trace running on over 4,000 machines without compromising performance. The TCO can decrease by nearly $68 million if similar strategy is extrapolated to Alibaba's 432,000 web facing servers.

中文翻译：

提高运行混合工作负载的大规模云系统的成本效率 - 以阿里巴巴集群跟踪为例

冠状病毒的流行极大地扰乱了零售业，许多商店被迫关闭，世界各地的人们都居家避难，网上购物成为不可避免的选择。为了满足快速增长的电子商务需求，更多的数据中心预计将提供新的或显着改进现有的云服务，以更好地支持混合工作负载（例如，支持排名或推荐系统的在线购买作业和批量作业）。成功的云系统需要有效处理并快速响应具有此类混合工作负载的大量流量。同时，降低总拥有成本 (TCO) 对于盈利至关重要。提高系统利用率是实现高性能和低 TCO 双重目标的有效技术之一。本文对阿里巴巴发布的2017年和2018年的集群轨迹进行了全面的分析，提供了阿里巴巴通过将时间敏感的在线服务作业与时间整合来提高其大规模云系统的性能和成本效率的最佳实践的案例。 - 不敏感的批处理作业。我们的调查表明，当前阿里巴巴系统中存在超额订阅（导致资源浪费、利用率低）和订阅不足（导致性能下降）的问题。我们开发了一个模拟器，使我们能够评估解决此问题的可能解决方案及其对性能、能耗和 TCO 的影响。我们的实验表明，对于在 4,000 多台机器上运行的 2018 年跟踪，预计 TCO 可以减少 600,000 美元，而不会影响性能。如果将类似的策略推广到阿里巴巴的 432,000 台面向 Web 的服务器，TCO 可以减少近 6800 万美元。

更新日期：2021-03-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文