A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory,ACM Transactions on Embedded Computing Systems

当前位置： X-MOL 学术 › ACM Trans. Embed. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory
ACM Transactions on Embedded Computing Systems ( IF 2.8 ) Pub Date : 2020-10-03 , DOI: 10.1145/3406536
Alexandru E. Şuşu ₁

Affiliation

Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging. Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point. By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels. Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time. We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.

中文翻译：

带有暂存器存储器的 Connex-S 加速器的向量长度无关编译器

为 Connex-S 编译顺序 C 程序是一种具有竞争力的、可扩展的和可定制的宽向量加速器，用于具有 32 到 4,096 个 16 位整数通道和有限容量本地暂存器存储器的密集型嵌入式应用程序，具有挑战性。我们的编译器工具链使用 LLVM 框架并以 OPINCAA、JIT 矢量汇编器和协调 C++ 库为目标，用于 Connex-S 加速任意 CPU 的计算。因此，我们在编译器中端解决高效矢量化、通信和同步方面的问题。我们对程序进行定量静态分析，其中包括符号大小的编译器内存分配器和 OPINCAA 的协调机制。我们还讨论了 Connex-S 处理器的 LLVM 后端以及自动生成指令选择代码的方法，以有效地模拟非本地类型（例如 32 位整数和 16 位浮点）的算术和逻辑运算。通过使用 JIT 向量组装并将 Connex-S 的向量长度编码为生成的 OPINCAA 程序中的参数，我们实现了向量长度不可知论，以支持在不同的嵌入式设备上执行，例如几个具有不同分辨率的数码相机，每个都配备自定义宽度的 Connex-S 加速器旨在为图像处理内核节省能量。由于 Connex-S 通常具有 256 KB 的有限容量本地暂存器内存，因此我们展示了如何使用 PPCG C-to-C 代码生成器来执行数据平铺以最小化总内核执行时间，以在本地内存中拟合更大的程序数据为准。我们为 Connex-S 加速器设计了一个准确的成本模型，以便在编译时选择最佳性能切片大小。我们成功编译了几个常用的简单基准，例如，在高性能和计算机视觉嵌入式应用中。我们报告说，在具有 128 个 16 位整数通道的 Connex-S 加速器上运行它们时，加速因子高达 11.33，而双核 ARM Cortex A9 主机的时钟频率高出 6.67 倍，总共有两个 128 位 Neon SIMD 单元。在高性能和计算机视觉嵌入式应用中。我们报告说，在具有 128 个 16 位整数通道的 Connex-S 加速器上运行它们时，加速因子高达 11.33，而双核 ARM Cortex A9 主机的时钟频率高出 6.67 倍，总共有两个 128 位 Neon SIMD 单元。在高性能和计算机视觉嵌入式应用中。我们报告说，在具有 128 个 16 位整数通道的 Connex-S 加速器上运行它们时，加速因子高达 11.33，而双核 ARM Cortex A9 主机的时钟频率高出 6.67 倍，总共有两个 128 位 Neon SIMD 单元。

更新日期：2020-10-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11