当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CPU overheating prediction in HPC systems
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2021-03-29 , DOI: 10.1002/cpe.6231
Marc Platini 1, 2 , Thomas Ropars 1 , Benoit Pelletier 1 , Noel Palma 2
Affiliation  

With the increase in size of supercomputers, also increases the number of abnormal events. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper presents a machine learning solution to predict such events. The proposed algorithm is based on dynamic time warping for feature extraction and on a machine learning algorithm for classification. It predicts overheating events solely by analyzing the trends of the temperature of the CPUs and can deal with very low temperature sampling rates while having a negligible computational cost in practice. Our evaluation, using data coming from a production supercomputer, shows that the proposed solution can make predictions a few minutes in advance with a good accuracy. Furthermore, considering two simple preventive actions to avoid CPU overheating events, we present an analytical study that shows that our predictive solution is good enough to allow a significant reduction of the cost of overheating events.

中文翻译:

HPC 系统中的 CPU 过热预测

随着超级计算机规模的增加,异常事件的数量也随之增加。CPU 过热就是这样一种会降低系统效率的事件:当 CPU 过热时,它会降低其频率。本文提出了一种机器学习解决方案来预测此类事件。所提出的算法基于用于特征提取的动态时间扭曲和用于分类的机器学习算法。它仅通过分析 CPU 的温度趋势来预测过热事件,并且可以处理非常低的温度采样率,同时在实践中的计算成本可以忽略不计。我们使用来自生产超级计算机的数据进行的评估表明,所提出的解决方案可以提前几分钟进行准确预测。此外,
更新日期:2021-03-29
down
wechat
bug