Straggler Mitigation at Scale,IEEE/ACM Transactions on Networking

当前位置： X-MOL 学术 › IEEE ACM Trans. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Straggler Mitigation at Scale
IEEE/ACM Transactions on Networking ( IF 3.0 ) Pub Date : 2019-10-28 , DOI: 10.1109/tnet.2019.2946464
Mehmet Fatih Aktas , Emina Soljanin

Runtime performance variability has been a major issue, hindering predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers have been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. The tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded redundancy compare in the cost vs. latency tradeoff? 2) Can we introduce redundancy after waiting some time and expect it to reduce the cost? 3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? We validate the answers we found for each of these questions via simulations that use empirical distributions extracted from a Google cluster data.

中文翻译：

大规模缓解流浪汉

运行时性能可变性已成为一个主要问题，阻碍了现代分布式系统中可预测和可扩展的性能。从理论上和实践上都证明，在多个服务器上冗余地执行请求或作业对于减轻可变性是有效的。采用冗余的系统引起了广泛的关注，许多论文分析了在各种服务模型和运行时可变性假设下冗余的痛苦和收获。本文介绍了通过采用复制或擦除编码的冗余来执行许多任务的作业的成本（痛苦）与延迟（收益）分析。服务时间可变性的尾部沉重程度决定了冗余的痛苦和获得，我们通过推导成本和等待时间的表达式来量化其影响。具体来说，我们尝试回答四个问题：1）复制和编码冗余如何在成本和延迟之间进行权衡？2）我们可以在等待一段时间后引入冗余并期望它可以降低成本吗？3）一段时间后重新启动似乎正在徘徊的任务是否可以帮助降低成本和/或延迟？4）一起使用冗余并重新启动是否有效？通过使用从Google集群数据中提取的经验分布的模拟，我们验证针对每个问题找到的答案。延迟权衡？2）我们可以在等待一段时间后引入冗余并期望它可以降低成本吗？3）一段时间后重新启动似乎正在徘徊的任务是否可以帮助降低成本和/或延迟？4）一起使用冗余并重新启动是否有效？我们通过使用从Google集群数据中提取的经验分布的模拟，对针对每个问题找到的答案进行验证。延迟权衡？2）我们可以在等待一段时间后引入冗余并期望它可以降低成本吗？3）一段时间后重新启动似乎正在徘徊的任务是否可以帮助降低成本和/或延迟？4）一起使用冗余并重新启动是否有效？我们通过使用从Google集群数据中提取的经验分布的模拟，对针对每个问题找到的答案进行验证。

更新日期：2020-01-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文