Whale: A Unified Distributed Training Framework,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Whale: A Unified Distributed Training Framework
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-11-18 , DOI: arxiv-2011.09208
Ang Wang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, Wei Lin

Data parallelism (DP) has been a common practice to speed up the training workloads for a long time. However, with the increase of data size and model size, DP has become less optimal for most distributed training workloads. Moreover, it does not work on models whose parameter size cannot fit into a single GPU's device memory. To enable and further improve the industrial-level giant model training, we present Whale, a unified distributed training framework. It provides comprehensive parallel strategies including data parallelism, model parallelism, operator sharding, pipeline, hybrid strategy, and automatic parallel strategy. To express complex training strategies effectively and efficiently in one framework, Whale IR is designed as the basic unit to explore and implement different distributed strategies. Moreover, Whale enables automatic parallelism upon using a meta-driven cost model. Whale is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code. To the best of our knowledge, Whale is the first work that can support various hybrid distributed strategies within one framework. In our experiment of Bert Large model, Whale pipeline strategy is 2.32 times faster than Horovod data parallelism (HDP) on 64 GPUs. In a large-scale image classification task (100,000 classes), Whale hybrid strategy, which consists of operator sharding and DP, is 14.8 times faster than HDP on 64 GPUs.

中文翻译：

Whale：统一的分布式训练框架

长期以来，数据并行 (DP) 一直是加速训练工作负载的常见做法。然而，随着数据规模和模型规模的增加，DP 对于大多数分布式训练工作负载来说变得不太理想。此外，它不适用于参数大小无法放入单个 GPU 设备内存的模型。为了实现并进一步改进工业级巨型模型训练，我们提出了 Whale，一个统一的分布式训练框架。它提供了全面的并行策略，包括数据并行、模型并行、算子分片、管道、混合策略和自动并行策略。为了在一个框架中有效且高效地表达复杂的训练策略，Whale IR 被设计为探索和实施不同分布式策略的基本单元。而且，Whale 使用元驱动成本模型实现自动并行。Whale 兼容 TensorFlow，只需添加几行代码即可轻松分发训练任务，而无需更改用户模型代码。据我们所知，Whale 是第一个可以在一个框架内支持各种混合分布式策略的作品。在我们对 Bert Large 模型的实验中，Whale 管道策略在 64 个 GPU 上比 Horovod 数据并行 (HDP) 快 2.32 倍。在大规模图像分类任务（100,000 类）中，由算子分片和 DP 组成的 Whale 混合策略在 64 个 GPU 上比 HDP 快 14.8 倍。Whale 是第一个可以在一个框架内支持各种混合分布式策略的作品。在我们对 Bert Large 模型的实验中，Whale 管道策略在 64 个 GPU 上比 Horovod 数据并行 (HDP) 快 2.32 倍。在大规模图像分类任务（100,000 类）中，由算子分片和 DP 组成的 Whale 混合策略在 64 个 GPU 上比 HDP 快 14.8 倍。Whale 是第一个可以在一个框架内支持各种混合分布式策略的作品。在我们对 Bert Large 模型的实验中，Whale 管道策略在 64 个 GPU 上比 Horovod 数据并行 (HDP) 快 2.32 倍。在大规模图像分类任务（100,000 类）中，由算子分片和 DP 组成的 Whale 混合策略在 64 个 GPU 上比 HDP 快 14.8 倍。

更新日期：2020-11-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>