Flynn’s Reconciliation,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Flynn’s Reconciliation
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-06-08 , DOI: 10.1145/3458357
Daniel Thuerck ₁ , Nicolas Weber ₂ , Roberto Bifulco ₂

Affiliation

A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.

中文翻译：

弗林的和解

近期高性能计算 (HPC) 和机器学习 (ML) 领域的大部分性能提升是由加速卡推动的。许多流行的 ML 框架通过将计算组织为一组高度优化的批处理通用内核上的计算图来支持加速器。虽然这种方法简化了每个单独加速器的内核实现，但 HPC 加速器架构之间日益增加的异构性使得创建此类内核的可移植和可扩展库变得复杂。因此，使用 CUDA 社区的 warp 寄存器缓存编程习惯的推广，我们提出了一种新的编程习惯 (CoRe) 和虚拟架构模型 (PIRCH)，抽象了 SIMD 和 SIMT 范式。我们定义并自动化了从单一来源到 PIRCH 中间表示的映射过程，并开发了为三种不同架构发布代码的后端：英特尔 AVX512、NVIDIA GPU 和 NEC SX-Aurora。由我们的批处理内核源到源编译器 borG 生成的代码可以与供应商调优的库竞争，并且比跨架构的手动调优内核快 2 倍。

更新日期：2021-06-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>