当前位置:
X-MOL 学术
›
arXiv.cs.PF
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Using hardware performance counters to speed up autotuning convergence on GPUs
arXiv - CS - Performance Pub Date : 2021-02-10 , DOI: arxiv-2102.05297 Jiří Filipovič, Jana Hozzová, Amin Nezarat, Jaroslav Oľha, Filip Petrovič
arXiv - CS - Performance Pub Date : 2021-02-10 , DOI: arxiv-2102.05297 Jiří Filipovič, Jana Hozzová, Amin Nezarat, Jaroslav Oľha, Filip Petrovič
Nowadays, GPU accelerators are commonly used to speed up general-purpose
computing tasks on a variety of hardware. However, due to the diversity of GPU
architectures and processed data, optimization of codes for a particular type
of hardware and specific data characteristics can be extremely challenging. The
autotuning of performance-relevant source-code parameters allows for automatic
optimization of applications and keeps their performance portable. Although the
autotuning process typically results in code speed-up, searching the tuning
space can bring unacceptable overhead if (i) the tuning space is vast and full
of poorly-performing implementations, or (ii) the autotuning process has to be
repeated frequently because of changes in processed data or migration to
different hardware. In this paper, we introduce a novel method for searching tuning spaces. The
method takes advantage of collecting hardware performance counters (also known
as profiling counters) during empirical tuning. Those counters are used to
navigate the searching process towards faster implementations. The method
requires the tuning space to be sampled on any GPU. It builds a
problem-specific model, which can be used during autotuning on various, even
previously unseen inputs or GPUs. Using a set of five benchmarks, we
experimentally demonstrate that our method can speed up autotuning when an
application needs to be ported to different hardware or when it needs to
process data with different characteristics. We also compared our method to
state of the art and show that our method is superior in terms of the number of
searching steps and typically outperforms other searches in terms of
convergence time.
中文翻译:
使用硬件性能计数器来加速GPU上的自动调整收敛
如今,GPU加速器通常用于加速各种硬件上的通用计算任务。但是,由于GPU架构和处理后的数据的多样性,针对特定类型的硬件和特定数据特征的代码优化可能极具挑战性。与性能相关的源代码参数的自动调整可实现应用程序的自动优化,并使性能保持可移植性。尽管自动调整过程通常会导致代码加速,但是在以下情况下搜索调整空间会带来不可接受的开销:(i)调整空间很大且执行性能很差,或者(ii)由于需要频繁重复进行自动调整,原因是处理的数据更改或迁移到不同的硬件。在本文中,我们介绍了一种搜索调整空间的新颖方法。该方法利用了在经验调整期间收集硬件性能计数器(也称为分析计数器)的优势。这些计数器用于导航搜索过程,以实现更快的实现。该方法需要在任何GPU上采样调整空间。它构建了特定于问题的模型,可在各种甚至以前看不见的输入或GPU进行自动调整期间使用它。使用一组五个基准测试,我们通过实验证明了当需要将应用程序移植到不同的硬件或需要处理具有不同特征的数据时,我们的方法可以加快自动调整的速度。
更新日期:2021-02-11
中文翻译:
使用硬件性能计数器来加速GPU上的自动调整收敛
如今,GPU加速器通常用于加速各种硬件上的通用计算任务。但是,由于GPU架构和处理后的数据的多样性,针对特定类型的硬件和特定数据特征的代码优化可能极具挑战性。与性能相关的源代码参数的自动调整可实现应用程序的自动优化,并使性能保持可移植性。尽管自动调整过程通常会导致代码加速,但是在以下情况下搜索调整空间会带来不可接受的开销:(i)调整空间很大且执行性能很差,或者(ii)由于需要频繁重复进行自动调整,原因是处理的数据更改或迁移到不同的硬件。在本文中,我们介绍了一种搜索调整空间的新颖方法。该方法利用了在经验调整期间收集硬件性能计数器(也称为分析计数器)的优势。这些计数器用于导航搜索过程,以实现更快的实现。该方法需要在任何GPU上采样调整空间。它构建了特定于问题的模型,可在各种甚至以前看不见的输入或GPU进行自动调整期间使用它。使用一组五个基准测试,我们通过实验证明了当需要将应用程序移植到不同的硬件或需要处理具有不同特征的数据时,我们的方法可以加快自动调整的速度。