当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Throughput Prediction of Asynchronous SGD in TensorFlow
arXiv - CS - Performance Pub Date : 2019-11-12 , DOI: arxiv-1911.04650
Zhuojin Li, Wumo Yan, Marco Paolieri, Leana Golubchik

Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the number of nodes increases. In this paper, we present a solution to predicting training throughput from profiling traces collected from a single-node configuration. Our approach is able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node. By accounting for the dependencies between received parts and pending computations, we predict overlaps between computation and communication and generate synthetic execution traces for configurations with multiple nodes. We validate our approach on TensorFlow training jobs for popular image classification neural networks, on AWS and on our in-house cluster, using nodes equipped with GPUs or only with CPUs. We also investigate the effects of data transmission policies used in TensorFlow and the accuracy of our approach when combined with optimizations of the transmission schedule.

中文翻译:

TensorFlow 中异步 SGD 的吞吐量预测

现代机器学习框架可以使用多个节点并行训练神经网络,每个计算参数都使用随机梯度下降 (SGD) 进行更新,并通过中央参数服务器异步共享它们。由于通信开销和瓶颈,集群中 SGD 更新的总吞吐量呈次线性扩展,随着节点数量的增加而饱和。在本文中,我们提出了一种通过分析从单节点配置收集的跟踪信息来预测训练吞吐量的解决方案。我们的方法能够对多个节点的交互以及参数服务器与每个节点之间的并发传输的调度进行建模。通过考虑接收到的部分和未决计算之间的依赖关系,我们预测计算和通信之间的重叠,并为具有多个节点的配置生成综合执行跟踪。我们在用于流行图像分类神经网络的 TensorFlow 训练作业、AWS 和我们的内部集群上验证了我们的方法,使用配备 GPU 或仅配备 CPU 的节点。我们还研究了 TensorFlow 中使用的数据传输策略的影响以及我们的方法与传输调度优化相结合时的准确性。
更新日期:2020-03-02
down
wechat
bug