Skip to main content
Log in

AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

An analysis of real-world operational data of Tianhe-1A (TH-1A) supercomputer system shows that chilled water data not only can reflect the status of a chiller system but also are related to supercomputer load. This study proposes AquaSee, a method that can predict the load and cooling system faults of supercomputers by using chilled water pressure and temperature data. This method is validated on the basis of real-world operational data of the TH-1A supercomputer system at the National Supercomputer Center in Tianjin. Datasets with various compositions are used to construct the prediction model, which is also established using different prediction sequence lengths. Experimental results show that the method that uses a combination of pressure and temperature data performs more effectively than that only consisting of either pressure or temperature data. The best inference sequence length is two points. Furthermore, an anomaly monitoring system is set up by using chilled water data to help engineers detect chiller system anomalies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Yang X J, Liao X K, Lu K et al. The Tianhe-1A supercomputer: Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3): 344-351.

    Article  Google Scholar 

  2. Sîrbu A, Babaoglu Ö. Towards a systematic analysis of cluster computing log data: The case of IBM BlueGene/Q. arXiv: 1410.4449v2, 2014. https://arxiv.org/pdf/1410.4449v2.pdf, June 2019.

  3. Patnaik D, Marwah M, Sharma R K et al. Data mining for modeling chiller systems in data centers. In Proc. the 9th International Symposium on Intelligent Data Analysis, May 2010, pp.125-136.

    Google Scholar 

  4. Patnaik D, Marwah M, Sharma R K et al. Temporal data mining approaches for sustainable chiller management in data centers. ACM Transactions on Intelligent Systems and Technology, 2011, 2(4): Article No. 34.

    Article  Google Scholar 

  5. Chou J S, Hsu Y C, Lin L T. Smart meter monitoring and data mining techniques for predicting refrigeration system performance. Expert Systems with Applications, 2014, 41(5): 2144-2156.

    Article  Google Scholar 

  6. Zapater M, Tuncer O, Ayala J L et al. Leakage-aware cooling management for improving server energy efficiency. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(10): 2764-2777.

    Article  Google Scholar 

  7. Dayarathna M, Wen Y, Fan R. Data center energy consumption modeling: A survey. IEEE Communications Surveys & Tutorials, 2017, 18(1): 732-794.

    Article  Google Scholar 

  8. Banerjee A, Mukherjee T, Varsamopoulos G et al. Coolingaware and thermal-aware workload placement for green HPC data centers. In Proc. the 2010 International Green Computing Conference, August 2010, pp.245-256.

  9. Chen T, Wang X, Giannakis G B. Cooling-aware energy and workload management in data centers via stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 2016, 10(2): 402-415.

    Article  Google Scholar 

  10. Liu Z, Chen Y, Bash C et al. Renewable and cooling aware workload management for sustainable data centers. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(1): 175-186.

    Article  Google Scholar 

  11. Li Y L, Wen Y G, Guan K, Tao D C. Transforming cooling optimization for green data center via deep reinforcement learning. IEEE Transactions on Cybernetics. doi:https://doi.org/10.1109/TCYB.2019.2927410.

  12. O’Brien K, Pietri I, Reddy R et al. A survey of power and energy predictive models in HPC systems and applications. ACM Computing Surveys, 2017, 50(3): Article No. 37.

    Article  Google Scholar 

  13. Etinski M, Corbalán J, Labarta J et al. Utilization driven power-aware parallel job scheduling. Computer Science —Research and Development, 2010, 25(3-4): 207-216.

    Article  Google Scholar 

  14. Butts J A, Sohi G S. A static power model for architects. In Proc. the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, December 2000, pp.191-201.

  15. Carbó A, Oró E, Salom J, Canuto M, Macías M, Guitart J. Experimental and numerical analysis for potential heat reuse in liquid cooled data centres. Energy Conversion and Management, 2016, 112: 135-145.

    Article  Google Scholar 

  16. Xu H, Feng C, Li B. Temperature aware workload management in geo-distributed data centers. ACM SIGMETRICS Performance Evaluation Review, 2013, 41(1): 373-374.

    Article  Google Scholar 

  17. Bates N J, Ghatikar G, Abdulla G et al. Electrical grid and supercomputing centers: An investigative analysis of emerging opportunities and challenges. Informatik Spektrum, 2015, 38(2): 111-127.

    Article  Google Scholar 

  18. Bai Y, Gu L, Qi X. Comparative study of energy performance between chip and inlet temperature-aware workload allocation in air-cooled data center. Energies, 2018, 11(3): Article No. 669.

  19. Meng J, Mccauley S, Kaplan F, Leung V, Coskun A. Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs. Sustainable Computing: Informatics and Systems, 2015, 6: 48-57.

    Google Scholar 

  20. Rahmani R, Moser I, Seyedmahmoudian M. A complete model for modular simulation of data centre power load. arXiv:1804.00703, 2018. https://arxiv.org/abs/1804.00703, June 2019.

  21. Ranganathan P, Leech P, Irwin D et al. Ensemblelevel power management for dense blade servers. ACM SIGARCH Computer Architecture News, 2006, 34(2): 66-77.

    Article  Google Scholar 

  22. Hilburg J C S, Zapater M, Risco-Martín J L et al. Unsupervised power modeling of co-allocated workloads for energy efficiency in data centers. In Proc. the 2016 Design, Automation & Test in Europe Conference & Exhibition, March 2016, pp.1345-1350.

  23. Sapankevych N I, Sankar R. Time series prediction using support vector machines: A survey. IEEE Computational Intelligence Magazine, 2009, 4(2): 24-38.

    Article  Google Scholar 

  24. Roy N, Dubey A, Gokhale A. Efficient autoscaling in the cloud using predictive models for workload forecasting. In Proc. the 4th IEEE International Conference on Cloud Computing, July 2011, pp.500-507.

  25. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780.

    Article  Google Scholar 

  26. Kumar J, Goomer R, Singh A K. Long short term memory recurrent neural network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Computer Science, 2018, 125: 676-682.

    Article  Google Scholar 

  27. Kong W, Dong Z Y, Jia Y et al. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid, 2019, 10(1): 841-851.

    Article  Google Scholar 

  28. Krstanovic S, Paulheim H. Ensembles of recurrent neural networks for robust time series forecasting. In Proc. the 37th SGAI International Conference on Artificial Intelligence, December 2017, pp.34-46.

  29. Malhotra P, Vig L, Shroff G, Agarwal P. Long short term memory networks for anomaly detection in time series. In Proc. the 23rd European Symposium on Artificial Neural Networks, April 2015, Article No. 15.

  30. Bontemps L, Cao V L, Mcdermott J et al. Collective anomaly detection based on long short term memory recurrent neural network. arXiv:1703.09752, 2017. https://arxiv.org/abs/1703.09752, June 2019.

  31. Filonov P, Lavrentyev A, Vorontsov A. Multivariate industrial time series with cyber-attack simulation: Fault detection using an LSTM-based predictive data model. arXiv:1612.06676, 2016. https://arxiv.org/abs/1612.06676, June 2019.

  32. Hundman K, Constantinou V, Laporte C et al. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 2018, pp.387-395.

  33. Wong C, Houlsby N, Lu Y et al. Transfer learning with Neural AutoML. arXiv:1803.02780v3, 2018. http://export.arxiv.org/abs/1803.02780v3, Aug. 2019.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Qi Li.

Electronic supplementary material

ESM 1

(PDF 345 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, YQ., Xiao, LQ., Feng, JH. et al. AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data. J. Comput. Sci. Technol. 35, 221–230 (2020). https://doi.org/10.1007/s11390-019-1951-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-019-1951-7

Keywords

Navigation