当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PARIS: Predicting application resilience using machine learning
Journal of Parallel and Distributed Computing ( IF 3.8 ) Pub Date : 2021-03-06 , DOI: 10.1016/j.jpdc.2021.02.015
Luanzheng Guo , Dong Li , Ignacio Laguna

The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time-consuming process of random FI and provides higher prediction accuracy than analytical models. PARIS captures the implicit relationship between application characteristics and application resilience, which is difficult to capture using most analytical models. We overcome many technical challenges for feature construction, extraction, and selection to use machine learning in our prediction approach. Our evaluation on 16 HPC benchmarks shows that PARIS achieves high prediction accuracy. PARIS is up to 450x faster than random FI (49x on average). Compared to the state-of-the-art analytical model, PARIS is at least 63% better in terms of accuracy and has comparable execution time on average.



中文翻译:

巴黎:使用机器学习预测应用程序的弹性

研究应用程序对HPC应用程序中的错误的恢复能力的传统方法使用故障注入(FI),这是一种耗时的方法。虽然已经建立了分析模型来克服FI的效率低下,但它们缺乏准确性。在本文中,我们提出了一种PARIS,它是一种预测应用程序弹性的机器学习方法,它避免了耗时的随机FI过程,并且比分析模型提供了更高的预测精度。PARIS捕获了应用程序特性和应用程序弹性之间的隐式关系,而使用大多数分析模型很难捕获。我们在特征构建,提取和选择以在我们的预测方法中使用机器学习的过程中克服了许多技术难题。我们对16个HPC基准的评估表明,PARIS可以实现较高的预测精度。PARIS比随机FI快450倍(平均49倍)。与最新的分析模型相比,PARIS的准确性至少提高了63%,并且平均执行时间相当。

更新日期:2021-03-21
down
wechat
bug