PyTorch Distributed: Experiences on Accelerating Data Parallel Training,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PyTorch Distributed: Experiences on Accelerating Data Parallel Training
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-28 , DOI: arxiv-2006.15704
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith Chintala

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

中文翻译：

PyTorch Distributed：加速数据并行训练的经验

本文介绍了 PyTorch 分布式数据并行模块的设计、实现和评估。PyTorch 是一种广泛采用的科学计算包，用于深度学习研究和应用。深度学习的最新进展证明了大型数据集和大型模型的价值，这需要将模型训练扩展到更多计算资源的能力。由于其简单的原理和广泛的适用性，数据并行已成为分布式训练的流行解决方案。一般来说，分布式数据并行技术在每个计算资源上复制模型以独立生成梯度，然后在每次迭代时传达这些梯度以保持模型副本的一致性。尽管该技术在概念上很简单，计算和通信之间微妙的依赖关系使得优化分布式训练效率变得非常重要。从 v1.5 开始，PyTorch 原生提供了多种技术来加速分布式数据并行，包括分桶梯度、与通信重叠的计算以及跳过梯度同步。评估表明，如果配置得当，PyTorch 分布式数据并行模块可以使用 256 个 GPU 实现接近线性的可扩展性。

更新日期：2020-06-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文