Automated Performance Modeling of HPC Applications Using Machine Learning,IEEE Transactions on Computers

当前位置： X-MOL 学术 › IEEE Trans. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automated Performance Modeling of HPC Applications Using Machine Learning
IEEE Transactions on Computers ( IF 3.6 ) Pub Date : 2020-01-10 , DOI: 10.1109/tc.2020.2964767
Jingwei Sun , Guangzhong Sun , Shiyan Zhan , Jiepeng Zhang , Yong Chen

Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.

中文翻译：

使用机器学习对 HPC 应用程序进行自动性能建模

并行程序的自动性能建模和性能预测在许多用例中非常有价值，例如指导任务管理和作业调度、提供应用程序行为的见解以及协助资源需求估计。并行程序的性能受到多种因素的影响，包括但不限于硬件、应用程序、算法和输入参数，因此准确的性能预测往往是一项具有挑战性和艰巨的任务。在本文中，我们重点关注自动预测具有不同输入、不同规模且无需领域知识的并行程序（更具体地说，MPI 程序）的执行时间。我们对执行时间和与域无关的运行时特征之间的相关性进行建模。这些功能包括变量值、分支计数器、循环和 MPI 通信。通过自动检测MPI程序，程序的每次执行都会输出一个特征向量及其相应的执行时间。从不同输入的执行中收集数据后，使用随机森林机器学习方法构建经验性能模型，该模型可以预测给定新输入的程序的执行时间。采用迁移学习方法重用现有的性能模型，提高在缺乏历史执行数据的新平台上的预测准确性。我们在三个不同系统上对三个并行应用程序 Graph500、GalaxSee 和 SMG2000 进行的实验和分析证实，我们的方法表现良好，平均预测误差低于 20%。

更新日期：2020-01-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11