当前位置: X-MOL 学术ACM Trans. Des. Autom. Electron. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Thermal Management for FPGA Nodes in HPC Systems
ACM Transactions on Design Automation of Electronic Systems ( IF 1.4 ) Pub Date : 2020-10-23 , DOI: 10.1145/3423494
Yingyi Luo 1 , Joshua C. Zhao 1 , Arnav Aggarwal 2 , Seda Ogrenci-Memik 1 , Kazutomo Yoshii 3
Affiliation  

The integration of FPGAs into large-scale computing systems is gaining attention. In these systems, real-time data handling for networking, tasks for scientific computing, and machine learning can be executed with customized datapaths on reconfigurable fabric within heterogeneous compute nodes. At the same time, thermal management, particularly battling the cooling cost and guaranteeing the reliability, is a continuing concern. The introduction of new heterogeneous components into HPC nodes only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is less explored. In this article, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning–based model to capture the thermal behavior of each individual FPGA in the cluster. We then propose two thermal management strategies guided by our thermal model. First, we mitigate thermal variation and hotspots across the cluster by proactive thermal-aware task placement. Under the tested system and benchmarks, we achieve up to 26.4° C and on average 13.3° C system temperature reduction with no performance penalty. Second, we utilize this thermal model to guide HLS parameter tuning at the task design stage to achieve improved thermal response after deployment.

中文翻译:

HPC 系统中 FPGA 节点的热管理

将 FPGA 集成到大规模计算系统中正受到关注。在这些系统中,网络的实时数据处理、科学计算的任务和机器学习可以在异构计算节点内的可重构结构上使用定制的数据路径来执行。与此同时,热管理,特别是与冷却成本作斗争和保证可靠性,是一个持续关注的问题。将新的异构组件引入 HPC 节点只会进一步增加热建模和管理的复杂性。部署在大型计算集群中的多 FPGA 系统的热行为研究较少。在本文中,我们首先展示了同一代不同 FPGA 的热行为可能会因其在机架中的物理位置和工艺变化而有所不同,即使他们正在运行相同的任务。我们提出了一个基于机器学习的模型来捕获集群中每个单独 FPGA 的热行为。然后,我们提出了两种由我们的热模型指导的热管理策略。首先,我们通过主动放置热感知任务来缓解集群中的热变化和热点。在经过测试的系统和基准测试中,我们实现了高达 26.4°C 和平均 13.3°C 的系统温度降低,而没有性能损失。其次,我们利用此热模型在任务设计阶段指导 HLS 参数调整,以在部署后实现改进的热响应。首先,我们通过主动放置热感知任务来缓解集群中的热变化和热点。在经过测试的系统和基准测试中,我们实现了高达 26.4°C 和平均 13.3°C 的系统温度降低,而没有性能损失。其次,我们利用此热模型在任务设计阶段指导 HLS 参数调整,以在部署后实现改进的热响应。首先,我们通过主动放置热感知任务来缓解集群中的热变化和热点。在经过测试的系统和基准测试中,我们实现了高达 26.4°C 和平均 13.3°C 的系统温度降低,而没有性能损失。其次,我们利用此热模型在任务设计阶段指导 HLS 参数调整,以在部署后实现改进的热响应。
更新日期:2020-10-23
down
wechat
bug