当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-02-01 , DOI: 10.1109/tpds.2021.3056045
Mochamad Asri , Dhairya Malhotra , Jiajun Wang , George Biros , Lizy K. John , Andreas Gerstlauer

In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.

中文翻译:

高性能计算的硬件加速器集成折衷:以N体方法进行GEMM加速的案例研究

在本文中,我们研究了一种流行的N体方法-最新的快速多极子方法(FMM),它在不同的硬件配置和使用方案下,硬件加速在性能和节能方面的优势。我们使用专用的专用集成电路(ASIC)来加速通用矩阵-矩阵乘法(GEMM)操作。FMM在应用程序中广泛使用,并且是许多HPC应用程序工作量的代表示例。我们比较了将GEMM ASIC集成到主存储器附近,内部或附近的体系结构,其片上耦合旨在最大程度地减少或避免通过DRAM进行往返加速传输,以实现加速器和CPU之间的通信。我们使用经过精确校准的详细x86 CPU,加速器和DRAM仿真来研究折衷方案。我们的结果表明,仅将加速器移动到离芯片较近的位置并不一定会带来性能/能量的提升。我们证明,尽管谨慎的软件阻塞和片上布局优化可以将DRAM的访问量比单纯的片上集成减少2倍,但这些DRAM流量的巨大节省并不能自动转化为可观的总能耗或运行时间节省。这主要归因于现代系统中的应用程序特性,高的空闲功率和有效的内存延迟隐藏。仅当应用更积极的协同优化(例如软件流水线和重叠)时,与基线加速相比,额外的性能和节能才能分别释放37%和35%。当通过片外集成应用类似的优化(流水线和重叠)时,
更新日期:2021-02-23
down
wechat
bug