当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Petrel: Heterogeneity-aware Distributed Deep Learning via Hybrid Synchronization
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-05-01 , DOI: 10.1109/tpds.2020.3040601
Qihua Zhou , Song Guo , Zhihao Qu , Peng Li , Li Li , Minyi Guo , Kun Wang

The parameter server (PS) paradigm has achieved great success in deploying large-scale distributed Deep Learning (DL) systems. However, these systems implicitly assume that the cluster is homogeneous and this assumption does not hold in many real-world cases. Although the previous efforts are paid to address heterogeneity, they mainly prioritize the contribution of fast workers and reduce the involvement of slow workers, resulting in the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization at the community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CASP), which uses the Asynchronous Advantage Actor-Critic (A3C)-based algorithm to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a prototype system called ${\sf Petrel}$Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under various benchmarks with multiple metrics and baseline comparison demonstrates the effectiveness of ${\sf Petrel}$Petrel. Specifically, ${\sf Petrel}$Petrel accelerates the training convergence speed by up to 1.87 × faster and reduces communication traffic by up to 26.85 percent, on average, over the non-community synchronization mechanisms.

中文翻译:

Petrel:通过混合同步的异构感知分布式深度学习

参数服务器 (PS) 范式在部署大规模分布式深度学习 (DL) 系统方面取得了巨大成功。然而,这些系统隐含地假设集群是同质的,并且这个假设在许多现实世界中并不成立。虽然之前的努力是为了解决异构性,但他们主要优先考虑快速工人的贡献并减少慢速工人的参与,从而导致工作量不平衡和计算效率低下的局限性。我们揭示了将工人分组到社区中,这是我们提出的一种抽象,并在社区级别处理参数同步可以克服这些限制并加速训练收敛进程。社区的灵感来自于我们对工人之间相似性的先验知识的探索,这往往被以前的工作所忽视。这些观察促使我们提出一种新的同步机制,名为社区感知同步并行 (CASP),它使用 异步优势演员-评论家基于(A3C)算法智能判断社区配置,全面提升同步性能。整个想法已在一个名为的原型系统中实现${\sf海燕}$海燕这在收敛效率和通信开销之间实现了很好的平衡。在多个指标和基线比较的各种基准下的评估证明了${\sf海燕}$海燕. 具体来说,${\sf海燕}$海燕 与非社区同步机制相比,将训练收敛速度提高了 1.87 倍,并将通信流量平均减少了 26.85%。
更新日期:2021-05-01
down
wechat
bug