当前位置: X-MOL 学术ACM Trans. Reconfig. Technol. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
UNILOGIC
ACM Transactions on Reconfigurable Technology and Systems ( IF 2.3 ) Pub Date : 2020-09-11 , DOI: 10.1145/3409115
Aggelos D. Ioannou 1 , Konstantinos Georgopoulos 2 , Pavlos Malakonakis 3 , Dionisios N. Pnevmatikatos 4 , Vassilis D. Papaefstathiou 5 , Ioannis Papaefstathiou 6 , Iakovos Mavroidis 7
Affiliation  

One of the main characteristics of High-performance Computing (HPC) applications is that they become increasingly performance and power demanding, pushing HPC systems to their limits. Existing HPC systems have not yet reached exascale performance mainly due to power limitations. Extrapolating from today’s top HPC systems, about 100–200 MWatts would be required to sustain an exaflop-level of performance. A promising solution for tackling power limitations is the deployment of energy-efficient reconfigurable resources (in the form of Field-programmable Gate Arrays (FPGAs)) tightly integrated with conventional CPUs. However, current FPGA tools and programming environments are optimized for accelerating a single application or even task on a single FPGA device. In this work, we present UNILOGIC (Unified Logic), a novel HPC-tailored parallel architecture that efficiently incorporates FPGAs. UNILOGIC adopts the Partitioned Global Address Space (PGAS) model and extends it to include hardware accelerators, i.e., tasks implemented on the reconfigurable resources. The main advantages of UNILOGIC are that (i) the hardware accelerators can be accessed directly by any processor in the system, and (ii) the hardware accelerators can access any memory location in the system. In this way, the proposed architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. The UNILOGIC architecture also provides hardware virtualization of the reconfigurable logic so that the hardware accelerators can be shared among multiple applications or tasks. The FPGA layer of the architecture is implemented by splitting its reconfigurable resources into (i) a static partition, which provides the PGAS-related communication infrastructure, and (ii) fixed-size and dynamically reconfigurable slots that can be programmed and accessed independently or combined together to support both fine and coarse grain reconfiguration. 1 Finally, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis, each of which includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs); each QFDB supports four tightly coupled Xilinx Zynq Ultrascale+ MPSoCs as well as 64 Gigabytes of DDR4 memory, and thus, the prototype features a total of 64 Zynq MPSoCs and 1 Terabyte of memory. We tuned and evaluated the UNILOGIC prototype using both low-level (baremetal) performance tests, as well as two popular real-world HPC applications, one compute-intensive and one data-intensive. Our evaluation shows that UNILOGIC offers impressive performance that ranges from being 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs, while it also outperforms GPUs by a factor ranging from 3 to 6 times in terms of time to solution, and from 10 to 20 times in terms of energy to solution.

中文翻译:

UNILOGIC

高性能计算 (HPC) 应用程序的主要特征之一是它们对性能和功率的要求越来越高,这将 HPC 系统推向了极限。主要由于功率限制,现有 HPC 系统尚未达到百亿亿次性能。从当今的顶级 HPC 系统推断,大约需要 100-200 MWatts 才能维持 exaflop 级别的性能。解决功率限制的一个有前途的解决方案是部署与传统 CPU 紧密集成的节能可重构资源(以现场可编程门阵列 (FPGA) 的形式)。然而,当前的 FPGA 工具和编程环境已针对在单个 FPGA 设备上加速单个应用程序甚至任务进行了优化。在这项工作中,我们提出了 UNILOGIC(统一逻辑),一种新颖的 HPC 量身定制的并行架构,有效地结合了 FPGA。UNILOGIC 采用分区全局地址空间(PGAS)模型并将其扩展为包括硬件加速器,即在可重构资源上执行的任务。UNILOGIC 的主要优点是 (i) 硬件加速器可以直接被系统中的任何处理器访问,以及 (ii) 硬件加速器可以访问系统中的任何内存位置。通过这种方式,所提出的架构提供了一个统一的环境,其中所有可重新配置的资源都可以被任何处理器/操作系统无缝使用。UNILOGIC 架构还提供可重新配置逻辑的硬件虚拟化,以便可以在多个应用程序或任务之间共享硬件加速器。1最后,UNILOGIC 架构已在一个定制原型上进行了评估,该原型由两个 1U 机箱组成,每个机箱包括八个互连的子板,称为 Quad-FPGA 子板 (QFDB);每个 QFDB 支持四个紧密耦合的 Xilinx Zynq Ultrascale+ MPSoC 以及 64 GB 的 DDR4 内存,因此,该原型共有 64 个 Zynq MPSoC 和 1 TB 的内存。我们使用低级(裸机)性能测试以及两种流行的现实世界 HPC 应用程序(一种计算密集型和一种数据密集型)对 UNILOGIC 原型进行了调整和评估。我们的评估表明,与仅使用高端 CPU 的传统并行系统相比,UNILOGIC 提供了令人印象深刻的性能,速度提高了 2.5 到 400 倍,能源效率提高了 46 到 300 倍,
更新日期:2020-09-11
down
wechat
bug