Time-Based Roofline for Deep Learning Performance Analysis,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Time-Based Roofline for Deep Learning Performance Analysis
arXiv - CS - Hardware Architecture Pub Date : 2020-09-09 , DOI: arxiv-2009.04598
Yunsong Wang, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth, Samuel Williams

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based approach to performance analysis to facilitate the optimization of these applications. This approach is an extension of the Roofline model widely used in traditional high-performance computing applications, and it incorporates both compute/bandwidth complexity and run time in its formulae to provide insights into deep learning-specific characteristics. We take two sets of representative kernels, 2D convolution and long short-term memory, to validate and demonstrate the use of this new approach, and investigate how arithmetic intensity, cache locality, auto-tuning, kernel launch overhead, and Tensor Core usage can affect performance. Compared to the common ad-hoc approach, this study helps form a more systematic way to analyze code performance and identify optimization opportunities for deep learning applications.

中文翻译：

用于深度学习性能分析的基于时间的屋顶线

深度学习应用程序通常是计算密集型的，需要很长时间才能进行训练和推理。硬件和软件方面的研究人员已经解决了这个问题，在本文中，我们提出了一种基于 Roofline 的性能分析方法，以促进这些应用程序的优化。这种方法是传统高性能计算应用程序中广泛使用的 Roofline 模型的扩展，它在其公式中结合了计算/带宽复杂性和运行时间，以提供对深度学习特定特征的洞察。我们采用两组代表性内核，2D 卷积和长短期记忆来验证和演示这种新方法的使用，并研究算术强度、缓存局部性、自动调整、内核启动开销、和 Tensor Core 的使用会影响性能。与常见的临时方法相比，这项研究有助于形成一种更系统的方法来分析代码性能并确定深度学习应用程序的优化机会。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文