Canary: A Decentralized Deep Learning via Gradient Sketch and Partition in Multi-interface Networks,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Canary: A Decentralized Deep Learning via Gradient Sketch and Partition in Multi-interface Networks
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-04-01 , DOI: 10.1109/tpds.2020.3036738
Qihua Zhou , Kun Wang , Haodong Lu , Wenyao Xu , Yanfei Sun , Song Guo

The multi-interface networks are efficient infrastructures to deploy distributed Deep Learning (DL) tasks as the model gradients generated by each worker can be exchanged to others via different links in parallel. Although this decentralized parameter synchronization mechanism can reduce the time of gradient exchange, building a high-performance distributed DL architecture still requires the balance of communication efficiency and computational utilization, i.e., addressing the issues of traffic burst, data consistency, and programming convenience. To achieve this goal, we intend to asynchronously exchange gradient pieces without the central control in multi-interface networks. We propose the Piece-level Gradient Exchange and Multi-interface Collective Communication to handle parameter synchronization and traffic transmission, respectively. Specifically, we design the gradient sketch approach based on 8-bit uniform quantization to compress gradient tensors and introduce the colayer abstraction to better handle gradient partition, exchange and pipelining. Also, we provide general programming interfaces to capture the synchronization semantics and build the Gradient Exchange Index (GEI) data structures to make our approach online applicable. We implement our algorithms into a prototype system called

${\sf Canary}$

Canary

by using PyTorch-1.4.0. Experiments conducted in Alibaba Cloud demonstrate that

${\sf Canary}$

Canary

reduces 56.28 percent traffic on average and completes the training by up to 1.61x, 2.28x, and 2.84x faster than BML, Ako on PyTorch, and PS on TensorFlow, respectively.

中文翻译：

Canary：通过多接口网络中的梯度草图和分区进行去中心化深度学习

多接口网络是部署分布式深度学习 (DL) 任务的有效基础设施，因为每个工人生成的模型梯度可以通过不同的链接并行交换给其他人。虽然这种去中心化的参数同步机制可以减少梯度交换的时间，但是构建高性能的分布式DL架构仍然需要在通信效率和计算利用率之间取得平衡，即解决流量突发、数据一致性和编程方便性等问题。为了实现这个目标，我们打算在没有多接口网络中央控制的情况下异步交换梯度片段。我们提出片级梯度交换和多接口集体通信分别处理参数同步和流量传输。具体来说，我们设计了渐变草图基于 8 位均匀量化的方法来压缩梯度张量并引入层抽象以更好地处理梯度分区、交换和流水线。此外，我们提供通用编程接口来捕获同步语义并构建梯度交换指数(GEI) 数据结构，使我们的方法在线适用。我们将我们的算法实现到一个原型系统中，称为

${\sf Canary}$

金丝雀

通过使用 PyTorch-1.4.0。在阿里云上进行的实验表明，

${\sf Canary}$

金丝雀

平均减少 56.28% 的流量，完成训练的速度分别比 BML、PyTorch 上的 Ako 和 TensorFlow 上的 PS 快 1.61 倍、2.28 倍和 2.84 倍。

更新日期：2021-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11