当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using hardware performance counters to speed up autotuning convergence on GPUs
arXiv - CS - Performance Pub Date : 2021-02-10 , DOI: arxiv-2102.05297
Jiří Filipovič, Jana Hozzová, Amin Nezarat, Jaroslav Oľha, Filip Petrovič

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

中文翻译:

使用硬件性能计数器来加速GPU上的自动调整收敛

如今,GPU加速器通常用于加速各种硬件上的通用计算任务。但是,由于GPU架构和处理后的数据的多样性,针对特定类型的硬件和特定数据特征的代码优化可能极具挑战性。与性能相关的源代码参数的自动调整可实现应用程序的自动优化,并使性能保持可移植性。尽管自动调整过程通常会导致代码加速,但是在以下情况下搜索调整空间会带来不可接受的开销:(i)调整空间很大且执行性能很差,或者(ii)由于需要频繁重复进行自动调整,原因是处理的数据更改或迁移到不同的硬件。在本文中,我们介绍了一种搜索调整空间的新颖方法。该方法利用了在经验调整期间收集硬件性能计数器(也称为分析计数器)的优势。这些计数器用于导航搜索过程,以实现更快的实现。该方法需要在任何GPU上采样调整空间。它构建了特定于问题的模型,可在各种甚至以前看不见的输入或GPU进行自动调整期间使用它。使用一组五个基准测试,我们通过实验证明了当需要将应用程序移植到不同的硬件或需要处理具有不同特征的数据时,我们的方法可以加快自动调整的速度。
更新日期:2021-02-11
down
wechat
bug