Compiling Halide Programs to Push-Memory Accelerators,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Compiling Halide Programs to Push-Memory Accelerators
arXiv - CS - Hardware Architecture Pub Date : 2021-05-26 , DOI: arxiv-2105.12858
Qiaoyi Liu, Dillon Huff, Jeff Setter, Maxwell Strange, Kathleen Feng, Kavya Sreedhar, Ziheng Wang, Keyi Zhang, Mark Horowitz, Priyanka Raina, Fredrik Kjolstad

Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as applications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these accelerators often use a different memory abstraction than CPUs and GPUs: push memories that send a data stream from one computation kernel to other kernels, possibly reordered. To address the compilation challenges caused by push memories, we propose that the representation of memory in the middle and backend of the compiler be altered to combine storage with address generation and control logic in a single structure -- a unified buffer. We show that this compiler abstraction can be implemented efficiently on a programmable accelerator, and design a memory mapping algorithm that combines polyhedral analysis and software vectorization techniques to target our accelerator. Our evaluation shows that the compiler supports programmability while maintaining high performance. It can compile a wide range of image processing and machine learning applications to our accelerator with 4.7x better runtime and 4.3x better energy-efficiency as compared to an FPGA.

中文翻译：

将 Halide 程序编译为 Push-Memory 加速器

图像处理和机器学习应用程序从硬件加速中受益匪浅，但现有编译器的目标是 FPGA，它牺牲了灵活硬件的功率和性能，或者 ASIC，随着应用程序的变化迅速变得过时。可编程域特定加速器已成为这两个极端之间有希望的中间地带，但此类架构传统上一直是编译器的困难目标。主要障碍是这些加速器通常使用与 CPU 和 GPU 不同的内存抽象：将数据流从一个计算内核发送到其他内核的内存推送，可能会重新排序。为了解决推送内存带来的编译挑战，我们建议修改编译器中后端内存的表示，将存储与地址生成和控制逻辑结合在一个单一的结构中——一个统一的缓冲区。我们展示了这种编译器抽象可以在可编程加速器上有效实现，并设计了一种内存映射算法，该算法结合了多面体分析和软件矢量化技术来针对我们的加速器。我们的评估表明编译器支持可编程性，同时保持高性能。与 FPGA 相比，它可以将各种图像处理和机器学习应用程序编译到我们的加速器中，运行时间和能源效率分别提高 4.7 倍和 4.3 倍。我们展示了可以在可编程加速器上有效地实现此编译器抽象，并设计了一种将多面体分析和软件矢量化技术结合在一起的内存映射算法，以我们的加速器为目标。我们的评估表明编译器支持可编程性，同时保持高性能。与 FPGA 相比，它可以将各种图像处理和机器学习应用程序编译到我们的加速器中，运行时间和能源效率分别提高 4.7 倍和 4.3 倍。我们展示了可以在可编程加速器上有效地实现此编译器抽象，并设计了一种将多面体分析和软件矢量化技术结合在一起的内存映射算法，以我们的加速器为目标。我们的评估表明编译器支持可编程性，同时保持高性能。与 FPGA 相比，它可以将各种图像处理和机器学习应用程序编译到我们的加速器中，运行时间和能源效率分别提高 4.7 倍和 4.3 倍。

更新日期：2021-05-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文