当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
ACM Transactions on Architecture and Code Optimization ( IF 1.5 ) Pub Date : 2021-01-08 , DOI: 10.1145/3434402
Paolo Sylos Labini 1 , Marco Cianfriglia 2 , Damiano Perri 3 , Osvaldo Gervasi 3 , Grigori Fursin 4 , Anton Lokhmotov 5 , Cedric Nugteren 6 , Bruno Carpentieri 7 , Fabiana Zollo 8 , Flavio Vella 1
Affiliation  

Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.

中文翻译:

关于加速 GPU 卷积内核及其他方面的预测模型的剖析

高效的 HPC 库通常会公开多个可调参数、算法实现或它们的组合,以提供优化的例程。最佳参数和算法选择可能取决于输入属性,例如运算中涉及的矩阵的形状。传统上,这些参数由自动调谐器手动调整或设置。在深度学习等新兴应用中,这种方法在实践中使用的各种输入和架构中都无效。在这项工作中,我们分析了不同的机器学习技术和预测模型来加速卷积算子和 GEMM。此外,我们解决了数据集生成问题,并研究了模型的性能、准确性和泛化能力。我们的洞察力使我们能够提高高端 GPU 上计算成本高昂的深度学习原语的性能,以及三个不同库上的低功耗嵌入式 GPU 架构的性能。实验结果表明,通过使用简单的决策树和基于 MLP 的模型,与基于供应商的自动调整和高度优化的启发式方法相比,目标应用程序的性能显着提高了 50% 到 300%。
更新日期:2021-01-08
down
wechat
bug