Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off,Entropy

当前位置： X-MOL 学术 › Entropy › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
Entropy ( IF 2.1 ) Pub Date : 2020-05-13 , DOI: 10.3390/e22050544
Emre Ozfatura , Sennur Ulukus , Deniz Gündüz

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

中文翻译：

落后者感知分布式学习：通信-计算延迟权衡

当梯度下降 (GD) 被扩展到大规模机器学习应用程序的许多并行工作人员时，其每次迭代计算时间受到分散工作人员的限制。通过在数据和计算之间分配冗余计算和/或编码，可以容忍分散的工作人员，但在大多数现有方案中，每个非分散的工作人员在完成所有计算后每次迭代都会向参数服务器（PS）传输一条消息。强加这样的限制会导致两个缺点：由于对落后行为的不准确预测导致的过度计算，以及由于丢弃掉队者执行的部分计算而导致的利用率不足。为了克服这些缺点，我们考虑了多消息通信 (MMC)，允许每次迭代从每个工作人员传送多个计算，并为编码计算和与 MMC 的编码通信提出了新颖的落后者避免技术。我们分析了如何有效地采用所提出的设计来寻求计算和通信延迟之间的平衡。此外，我们通过广泛的模拟（基于模型和在 Amazon EC2 服务器上的实际实施）确定了这些设计在不同设置中的优缺点，并证明了采用 MMC 的提议方案可以帮助改进现有的落后者避免方案。

更新日期：2020-05-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11