OD-SGD,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

OD-SGD
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2020-09-30 , DOI: 10.1145/3417607
Yemao Xu ₁ , Dezun Dong ₁ , Yawei Zhao ₁ , Weixia Xu ₁ , Xiangke Liao ₁

Affiliation

The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly applied in the distributed training process, i.e., the Synchronous SGD algorithm (SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence point while the training speed is slowed down by the synchronous barrier. ASGD has faster training speed but the convergence point is lower when compared to SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process. Therefore, we can achieve similar convergence point and training speed as SSGD and ASGD separately. To the best of our knowledge, we make the first attempt to combine the features of SSGD and ASGD to improve distributed training performance. Each iteration of OD-SGD contains a global update in the parameter server node and local updates in the worker nodes, the local update is introduced to update and compensate the delayed local weights. We evaluate our proposed algorithm on MNIST, CIFAR-10, and ImageNet datasets. Experimental results show that OD-SGD can obtain similar or even slightly better accuracy than SSGD, while its training speed is much faster, which even exceeds the training speed of ASGD.

中文翻译：

OD-SGD

现代深度学习神经网络的训练需要大量计算，而这些计算通常由 GPU 或其他特定加速器提供。为了横向扩展以获得更快的训练速度，分布式训练过程中主要应用了两种更新算法，即同步SGD算法（SSGD）和异步SGD算法（ASGD）。SSGD 获得了良好的收敛点，同时训练速度因同步屏障而减慢。与 SSGD 相比，ASGD 的训练速度更快，但收敛点更低。为了充分利用 SSGD 和 ASGD 的优势，我们提出了一种名为 One-step Delay SGD (OD-SGD) 的新技术，以在训练过程中结合它们的优势。因此，我们可以分别达到与 SSGD 和 ASGD 相似的收敛点和训练速度。据我们所知，我们首次尝试结合 SSGD 和 ASGD 的特征来提高分布式训练性能。OD-SGD的每次迭代都包含参数服务器节点的全局更新和工作节点的局部更新，引入局部更新来更新和补偿延迟的局部权重。我们在 MNIST、CIFAR-10 和 ImageNet 数据集上评估我们提出的算法。实验结果表明，OD-SGD 可以获得与 SSGD 相近甚至略高的准确率，而其训练速度要快得多，甚至超过了 ASGD 的训练速度。引入局部更新来更新和补偿延迟的局部权重。我们在 MNIST、CIFAR-10 和 ImageNet 数据集上评估我们提出的算法。实验结果表明，OD-SGD 可以获得与 SSGD 相近甚至略高的准确率，而其训练速度要快得多，甚至超过了 ASGD 的训练速度。引入局部更新来更新和补偿延迟的局部权重。我们在 MNIST、CIFAR-10 和 ImageNet 数据集上评估我们提出的算法。实验结果表明，OD-SGD 可以获得与 SSGD 相近甚至略高的准确率，而其训练速度要快得多，甚至超过了 ASGD 的训练速度。

更新日期：2020-09-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>