当前位置: X-MOL 学术Int. J. High Perform. Comput. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling
The International Journal of High Performance Computing Applications ( IF 3.1 ) Pub Date : 2020-06-03 , DOI: 10.1177/1094342020921340
James D Stevens 1 , Andreas Klöckner 1
Affiliation  

The ability to model, analyze, and predict execution time of computations is an important building block that supports numerous efforts, such as load balancing, benchmarking, job scheduling, developer-guided performance optimization, and the automation of performance tuning for high performance, parallel applications. In today’s increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs, which are increasingly prevalent in the world’s fastest supercomputers. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. With this approach, we empower the user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define their own model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. As application examples of our approach, we demonstrate both linear and nonlinear models; these examples are designed to predict execution times for multiple variants of a particular computation: two matrix-matrix multiplication variants, four discontinuous Galerkin differentiation operation variants, and two 2D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this highly user-customizable approach as a response to a central question arising in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes approaches requiring manual collection of kernel or hardware statistics.

中文翻译:

一种在跨机黑盒 GPU 性能建模中平衡精度和范围的机制

建模、分析和预测计算执行时间的能力是一个重要的构建块,它支持多种工作,例如负载平衡、基准测试、作业调度、开发人员引导的性能优化以及高性能、并行的性能调优自动化应用程序。在当今日益异构的计算环境中,这项任务必须跨多种架构高效完成,包括像 GPU 这样的大规模并行协处理器,它们在世界上最快的超级计算机中越来越普遍。为了应对这一挑战,我们提出了一种为 GPU 内核构建可定制的跨机器性能模型的方法,包括一种自动和象征性地收集与性能相关的内核操作计数的机制,使用这些计数来制定数学模型的工具,以及用于以黑盒方式将模型校准到 GPU 的可定制参数化基准内核集合。通过这种方法,我们使用户能够管理模型准确性、评估速度和通用性之间的权衡。用户可以定义他们自己的模型并定制校准过程,使其根据需要变得简单或复杂,以及根据需要针对应用程序或通用。作为我们方法的应用示例,我们展示了线性和非线性模型;这些示例旨在预测特定计算的多个变体的执行时间:两个矩阵-矩阵乘法变体、四个不连续伽辽金微分运算变体和两个二维五点有限差分模板变体。对于每个变体,我们展示了来自多个供应商和硬件世代的 GPU 的准确度结果。我们认为这种高度用户可定制的方法是对 GPU 性能建模中出现的一个核心问题的回应:我们如何以成本解释性的方式对 GPU 性能进行建模,同时保持准确性、评估速度、便携性和易用性,这是我们的一个属性相信排除了需要手动收集内核或硬件统计信息的方法。
更新日期:2020-06-03
down
wechat
bug