当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The RECIPE Approach to Challenges in Deeply Heterogeneous High Performance Systems
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-03-04 , DOI: arxiv-2103.03044
Giovanni Agosta, William Fornaciari, David Atienza, Ramon Canal, Alessandro Cilardo, José Flich Cardo, Carles Hernandez Luz, Michal Kulczewski, Giuseppe Massari, Rafael Tornero Gavilá, Marina Zapater

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee application performance is identified as the key concern for RECIPE, and is addressed via hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modelling thermal properties, mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.

中文翻译:

RECIPE方法应对深层异构高性能系统中的挑战

RECIPE(异构Exascale系统的可靠功率和时间约束感知的预测管理)是H2020 FETHPC计划中最近启动的项目,其明确目标是探索新的高性能计算(HPC)技术。RECIPE旨在引入分层的运行时资源管理基础架构,以优化能源效率并最大程度地减少热点的发生,同时加强应用程序施加的时间限制,并确保在深度异构加速器上运行的时间关键型和面向吞吐量的计算的可靠性基于系统。本文对RECIPE进行了详细的概述,确定了该项目解决的基本挑战和关键创新。特别是,最大限度地延长硬件寿命并确保应用程序性能的预测可靠性方法的需求已被确定为RECIPE的主要关注点,并通过对应用程序延迟和获得的硬件可靠性进行估算,通过对系统异构架构组件的分层资源管理来解决分别通过时序分析和热特性建模,子系统的平均失效时间。我们展示了预测准确性对检查点策略施加的间接费用的影响,以及对天气预报用例的可能应用。由分别通过时序分析和热属性建模,子系统的平均故障时间获得的应用程序延迟和硬件可靠性的估计值驱动。我们展示了预测准确性对检查点策略施加的间接费用的影响,以及对天气预报用例的可能应用。由分别通过时序分析和热属性建模,子系统的平均故障时间获得的应用程序延迟和硬件可靠性的估计值驱动。我们展示了预测准确性对检查点策略施加的间接费用的影响,以及对天气预报用例的可能应用。
更新日期:2021-03-05
down
wechat
bug