当前位置: X-MOL 学术J. Comput. Sci. Tech. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data
Journal of Computer Science and Technology ( IF 1.9 ) Pub Date : 2020-01-01 , DOI: 10.1007/s11390-019-1951-7
Yu-Qi Li , Li-Quan Xiao , Jing-Hua Feng , Bin Xu , Jian Zhang

An analysis of real-world operational data of Tianhe-1A (TH-1A) supercomputer system shows that chilled water data not only can reflect the status of a chiller system but also are related to supercomputer load. This study proposes AquaSee, a method that can predict the load and cooling system faults of supercomputers by using chilled water pressure and temperature data. This method is validated on the basis of real-world operational data of the TH-1A supercomputer system at the National Supercomputer Center in Tianjin. Datasets with various compositions are used to construct the prediction model, which is also established using different prediction sequence lengths. Experimental results show that the method that uses a combination of pressure and temperature data performs more effectively than that only consisting of either pressure or temperature data. The best inference sequence length is two points. Furthermore, an anomaly monitoring system is set up by using chilled water data to help engineers detect chiller system anomalies.

中文翻译:

AquaSee:使用冷冻水数据预测超级计算机的负载和冷却系统故障

对天河一号A(TH-1A)超级计算机系统实际运行数据的分析表明,冷冻水数据不仅可以反映冷水机组系统的状态,还与超级计算机负载有关。本研究提出了 AquaSee,这是一种可以使用冷冻水压力和温度数据预测超级计算机负载和冷却系统故障的方法。该方法基于天津国家超级计算机中心 TH-1A 超级计算机系统的真实运行数据进行了验证。使用不同组成的数据集来构建预测模型,该模型也是使用不同的预测序列长度建立的。实验结果表明,使用压力和温度数据组合的方法比仅包含压力或温度数据的方法更有效。最佳推理序列长度为两点。此外,通过使用冷冻水数据建立异常监控系统,帮助工程师检测冷水机系统异常。
更新日期:2020-01-01
down
wechat
bug