Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
arXiv - CS - Performance Pub Date : 2020-04-07 , DOI: arxiv-2004.03072
Shijian Li and Robert J. Walls and Tian Guo

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers. In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

中文翻译：

使用瞬态云 GPU 服务器表征和建模分布式训练

Cloud GPU 服务器已成为深度学习从业者在大规模数据集上训练复杂模型的事实上的方式。然而，在平衡训练时间、成本和模型准确性的权衡的同时，为不同的训练工作负载确定合适的集群配置——例如，服务器类型和数量——具有挑战性。增加复杂性的是通过使用更便宜但可撤销的瞬态 GPU 服务器来降低货币成本的潜力。在这项工作中，我们使用基于云的测量和训练框架 CM-DARE 分析了不同集群配置下的分布式训练性能。我们的经验数据集包括来自三种 GPU 类型、六个地理区域、二十个卷积神经网络和数千个 Google Cloud 服务器的测量值。我们还展示了使用基于回归的模型预测训练速度和开销的可行性。最后，我们讨论了性能建模的潜在用例，例如检测和缓解性能瓶颈。

更新日期：2020-04-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文