当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extending High-Level Synthesis for Task-Parallel Programs
arXiv - CS - Hardware Architecture Pub Date : 2020-09-23 , DOI: arxiv-2009.11389
Yuze Chi, Licheng Guo, Young-kyu Choi, Jie Wang and Jason Cong

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of result (QoR) and short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools support task-parallel programs, the productivity is greatly limited in the code development, correctness verification, and QoR tuning cycles, due to the poor programmability, restricted software simulation, and slow code generation, respectively. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, universal software simulation, and fast code generation to overcome these limitations. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly accelerated by 3.2xand 6.8x, respectively.

中文翻译:

扩展任务并行程序的高级综合

近年来,基于 C/C++/OpenCL 的高级综合 (HLS) 因其具有竞争力的结果质量 (QoR) 和开发周期短,在许多应用领域中越来越受现场可编程门阵列 (FPGA) 加速器的欢迎周期与传统的寄存器传输级 (RTL) 设计方法相比。然而,受顺序 C 语义的限制,在许多其他应用程序领域采用相同的高效高级编程方法仍然具有挑战性,其中粗粒度任务并行运行并在细粒度级别相互通信。虽然目前的 HLS 工具支持任务并行程序,但由于可编程性差、软件模拟受限和代码生成速度慢,在代码开发、正确性验证和 QoR 调整周期方面的生产力受到很大限制,分别。这种有限的生产力通常会违背 HLS 的目的,并阻碍程序员将 HLS 用于任务并行的 FPGA 加速器。在本文中,我们扩展了 HLS C++ 语言并提出了一个完全自动化的框架,该框架具有程序员友好的界面、通用软件模拟和快速代码生成,以克服这些限制。基于广泛的现实世界任务并行程序的实验结果表明,内核和主机代码的行数平均分别减少了 22% 和 51%,大大提高了可编程性。正确性验证和迭代 QoR 调整周期都分别大大加速了 3.2 倍和 6.8 倍。这种有限的生产力通常会违背 HLS 的目的,并阻碍程序员将 HLS 用于任务并行的 FPGA 加速器。在本文中,我们扩展了 HLS C++ 语言并提出了一个完全自动化的框架,该框架具有程序员友好的界面、通用软件模拟和快速代码生成,以克服这些限制。基于广泛的现实世界任务并行程序的实验结果表明,内核和主机代码的行数平均分别减少了 22% 和 51%,大大提高了可编程性。正确性验证和迭代 QoR 调整周期都分别大大加速了 3.2 倍和 6.8 倍。这种有限的生产力通常会违背 HLS 的目的,并阻碍程序员将 HLS 用于任务并行的 FPGA 加速器。在本文中,我们扩展了 HLS C++ 语言并提出了一个完全自动化的框架,该框架具有程序员友好的界面、通用软件模拟和快速代码生成,以克服这些限制。基于广泛的现实世界任务并行程序的实验结果表明,内核和主机代码的行数平均分别减少了 22% 和 51%,大大提高了可编程性。正确性验证和迭代 QoR 调整周期都分别大大加速了 3.2 倍和 6.8 倍。并快速生成代码以克服这些限制。基于广泛的现实世界任务并行程序的实验结果表明,内核和主机代码的行数平均分别减少了 22% 和 51%,大大提高了可编程性。正确性验证和迭代 QoR 调整周期都分别大大加速了 3.2 倍和 6.8 倍。并快速生成代码以克服这些限制。基于广泛的现实世界任务并行程序的实验结果表明,内核和主机代码的行数平均分别减少了 22% 和 51%,大大提高了可编程性。正确性验证和迭代 QoR 调整周期都分别大大加速了 3.2 倍和 6.8 倍。
更新日期:2020-09-25
down
wechat
bug