Stability and Optimization of Speculative Queueing Networks,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Stability and Optimization of Speculative Queueing Networks
arXiv - CS - Performance Pub Date : 2021-04-21 , DOI: arxiv-2104.10426
Jonatha Anselmi, Neil Walton

We provide a queueing-theoretic framework for job replication schemes based on the principle "\emph{replicate a job as soon as the system detects it as a \emph{straggler}}". This is called job \emph{speculation}. Recent works have analyzed {replication} on arrival, which we refer to as \emph{replication}. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the network, otherwise the job is terminated and relaunched or resumed at another server where it will complete. We provide a necessary and sufficient condition for the stability of speculative queueing networks with heterogeneous servers, general job sizes and scheduling disciplines. We find that speculation can increase the stability region of the network when compared with standard load balancing models and replication schemes. We provide general conditions under which timeouts increase the size of the stability region and derive a formula for the optimal speculation time, i.e., the timeout that minimizes the load induced through speculation. We compare speculation with redundant-$d$ and redundant-to-idle-queue-$d$ rules under an $S\& X$ model. For light loaded systems, redundancy schemes provide better response times. However, for moderate to heavy loadings, redundancy schemes can lose capacity and have markedly worse response times when compared with a speculative scheme.

中文翻译：

投机排队网络的稳定性和优化

我们基于“ \ emph {一旦系统将其检测为\ emph {straggler}}后立即复制作业”原理为作业复制方案提供了一个排队理论框架。这称为作业\ emph {speculation}。最近的作品对到达时的{replication}进行了分析，我们将其称为\ emph {replication}。复制是由于其在Google BigTable中的实现而引起的。但是，诸如Apache Spark和Hadoop MapReduce之类的系统可实现推测性作业执行。投机作业执行的性能和优化尚不十分清楚。为此，我们提出了一种用于负载平衡的排队网络模型，其中每个服务器都可以推测作业的执行时间。具体来说，每个作业最初都是由前端调度程序分配给单个服务器的。然后，开始执行时，服务器将设置超时。如果作业在超时之前完成，它将离开网络，否则作业将终止并在将完成该作业的另一台服务器上重新启动或恢复。我们为具有异构服务器，一般工作规模和调度规则的推测性排队网络的稳定性提供了必要和充分的条件。我们发现，与标准的负载平衡模型和复制方案相比，推测可以增加网络的稳定性。我们提供了超时增加稳定区域大小的一般条件，并推导了最佳投机时间的公式，即使通过投机引起的负载最小化的超时。我们将推测与$ S \＆X $模型下的冗余$ d $规则和冗余到空闲队列$ d $规则进行比较。对于轻载系统，冗余方案可提供更好的响应时间。但是，对于中等负载到重负载，冗余方案可能会失去容量，并且与推测方案相比，响应时间会明显缩短。

更新日期：2021-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文