Communication-efficient Decentralized Machine Learning over Heterogeneous Networks,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Communication-efficient Decentralized Machine Learning over Heterogeneous Networks
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-09-12 , DOI: arxiv-2009.05766
Pan Zhou, Qian Lin, Dumitrel Loghin, Beng Chin Ooi, Yuncheng Wu, Hongfang Yu

In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state-of-the-art machine learning approaches to perform efficient training. Both centralized and decentralized training approaches suffer from low-speed links. In this paper, we propose a decentralized approach, namely NetMax, that enables worker nodes to communicate via high-speed links and, thus, significantly speed up the training process. NetMax possesses the following novel features. First, it consists of a novel consensus algorithm that allows worker nodes to train model copies on their local dataset asynchronously and exchange information via peer-to-peer communication to synchronize their local copies, instead of a central master node (i.e., parameter server). Second, each worker node selects one peer randomly with a fine-tuned probability to exchange information per iteration. In particular, peers with high-speed links are selected with high probability. Third, the probabilities of selecting peers are designed to minimize the total convergence time. Moreover, we mathematically prove the convergence of NetMax. We evaluate NetMax on heterogeneous cluster networks and show that it achieves speedups of 3.7X, 3.4X, and 1.9X in comparison with the state-of-the-art decentralized training approaches Prague, Allreduce-SGD, and AD-PSGD, respectively.

中文翻译：

异构网络上高效通信的分散式机器学习

在过去几年中，分布式机器学习通常在异构网络上执行，例如多租户集群中的局域网或连接数据中心和边缘集群的广域网。在这些异构网络中，工作节点之间的链接速度差异很大，这使得最先进的机器学习方法难以执行有效的训练。集中式和分散式训练方法都受到低速链接的影响。在本文中，我们提出了一种分散的方法，即 NetMax，它使工作节点能够通过高速链接进行通信，从而显着加快训练过程。NetMax 具有以下新颖的功能。第一的，它由一种新颖的共识算法组成，该算法允许工作节点在其本地数据集上异步训练模型副本，并通过点对点通信交换信息以同步其本地副本，而不是中央主节点（即参数服务器）。其次，每个工作节点以微调的概率随机选择一个对等节点，以在每次迭代中交换信息。特别是，具有高速链路的对等点被选择的概率很高。第三，选择对等点的概率旨在最小化总收敛时间。此外，我们在数学上证明了 NetMax 的收敛性。我们在异构集群网络上评估 NetMax，并表明与最先进的分散式训练方法布拉格、Allreduce-SGD 和 AD-PSGD 相比，它实现了 3.7 倍、3.4 倍和 1.9 倍的加速，

更新日期：2020-10-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>