Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads,IEEE Micro

当前位置： X-MOL 学术 › IEEE Micro › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads
IEEE Micro ( IF 2.8 ) Pub Date : 2021-08-24 , DOI: 10.1109/mm.2021.3097287
Lukasz Wesolowski ₁ , Bilge Acun ₁ , Valentin Andrei ₁ , Adnan Aziz ₁ , Gisle Dankel ₁ , Christopher Gregg ₂ , Xiaoqiao Meng ₁ , Cyril Meurillon ₁ , Denis Sheahan ₁ , Lei Tian ₁ , Janet Yang ₃ , Peifeng Yu ₄ , Kim Hazelwood ₁

Affiliation

In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3) provides periodic and on-demand whole-system profiling for workflows; and 4) automatically analyzes traces for common antipatterns. We present each component of the stack and show case studies demonstrating the use of the tools to significantly improve performance. To our knowledge, our system is the most complete and effective solution for identifying and addressing efficiency problems in datacenter-scale GPU deployments.

中文翻译：

GPU 机器学习工作负载的数据中心规模分析和优化

在本文中，我们提出了一个系统，可以集体优化 Facebook 机器学习工作负载的大规模 GPU 服务器部署的效率。我们的系统 1) 测量并存储每个执行工作流程的系统范围效率指标； 2) 聚合来自整个执行堆栈的数据，以识别优化机会，最大限度地提高整个车队的效率； 3) 为工作流程提供定期和按需的全系统分析； 4) 自动分析常见反模式的踪迹。我们介绍了该堆栈的每个组件，并展示了案例研究，展示了如何使用这些工具来显着提高性能。据我们所知，我们的系统是识别和解决数据中心规模 GPU 部署中效率问题的最完整、最有效的解决方案。

更新日期：2021-08-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11