GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
arXiv - CS - Computation and Language Pub Date : 2020-06-30 , DOI: arxiv-2006.16668
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

中文翻译：

GShard：使用条件计算和自动分片扩展巨型模型

在具有大量训练数据和计算的许多现实世界机器学习应用程序中，神经网络缩放对于提高模型质量至关重要。尽管这种扩展趋势被确认是提高模型质量的可靠方法，但在此过程中仍存在计算成本、易于编程以及在并行设备上的高效实现等挑战。GShard 是一个由一组轻量级注解 API 和 XLA 编译器扩展组成的模块。它提供了一种优雅的方式来表达广泛的并行计算模式，而对现有模型代码的更改最少。GShard 使我们能够使用自动分片将具有稀疏门控专家混合的多语言神经机器翻译 Transformer 模型扩展到超过 6000 亿个参数。

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文