Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects,IEEE Micro

当前位置： X-MOL 学术 › IEEE Micro › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects
IEEE Micro ( IF 2.8 ) Pub Date : 2020-01-01 , DOI: 10.1109/mm.2019.2949986
Ammar Ahmad Awan , Arpan Jain , Ching-Hsiang Chu , Hari Subramoni , Dhableswar K. Panda

Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.

中文翻译：

具有高性能互连的集群上深度学习工作负载的通信分析和表征

具有 GPU 的异构高性能计算系统配备了 InfiniBand、Omni-Path、PCIe 和 NVLink 等高性能互连。然而，很少有文献描述这些互连对分布式深度学习 (DL) 的性能影响。在本文中，我们选择 Horovod，一种分布式训练中间件，除了标准 MPI 微基准测试外，还使用 TensorFlow 和 PyTorch 分析和剖析各种 DNN 训练工作负载。我们使用各种具有 CPU（如 Intel Xeon 和 IBM POWER9）、GPU（如 Volta V100）和各种互连的系统来分析以下指标：1）使用 Horovod 的张量融合的消息大小；2) 没有张量融合的消息大小；3) MPI/NCCL 调用次数；4) 每次 MPI/NCCL 调用所用的时间。我们在不同平台上观察到非 2 次幂消息大小的极端性能变化。为了解决这个问题，我们为 Horovod 设计了一个消息填充方案，展示了更加平滑的 allreduce 延迟配置文件，并报告了我们观察到端到端训练改进的案例。

更新日期：2020-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11