当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Maximizing Parallelism in Distributed Training for Huge Neural Networks
arXiv - CS - Performance Pub Date : 2021-05-30 , DOI: arxiv-2105.14450
Zhengda Bian, Qifan Xu, Boxiang Wang, Yang You

The recent Natural Language Processing techniques have been refreshing the state-of-the-art performance at an incredible speed. Training huge language models is therefore an imperative demand in both industry and academy. However, huge language models impose challenges to both hardware and software. Graphical processing units (GPUs) are iterated frequently to meet the exploding demand, and a variety of ASICs like TPUs are spawned. However, there is still a tension between the fast growth of the extremely huge models and the fact that Moore's law is approaching the end. To this end, many model parallelism techniques are proposed to distribute the model parameters to multiple devices, so as to alleviate the tension on both memory and computation. Our work is the first to introduce a 3-dimensional model parallelism for expediting huge language models. By reaching a perfect load balance, our approach presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model parallelism. Our experiments on 64 TACC's V100 GPUs show that our 3-D parallelism outperforms the 1-D and 2-D parallelism with 2.32x and 1.57x speedup, respectively.

中文翻译:

在大型神经网络的分布式训练中最大化并行性

最近的自然语言处理技术以令人难以置信的速度刷新了最先进的性能。因此,训练庞大的语言模型是行业和学术界的迫切需求。然而,庞大的语言模型对硬件和软件都提出了挑战。图形处理单元 (GPU) 被频繁迭代以满足爆炸式增长的需求,并催生了各种 ASIC,如 TPU。然而,在极其庞大的模型的快速增长与摩尔定律接近尾声之间仍然存在紧张关系。为此,许多模型并行技术被提出来将模型参数分配到多个设备,以减轻内存和计算的压力。我们的工作是第一个引入 3 维模型并行性以加速大型语言模型的工作。通过达到完美的负载平衡,我们的方法比现有最先进的一维和二维模型并行具有更小的内存和通信成本。我们在 64 个 TACC 的 V100 GPU 上的实验表明,我们的 3-D 并行性分别以 2.32 倍和 1.57 倍的加速比优于 1-D 和 2-D 并行性。
更新日期:2021-06-01
down
wechat
bug