ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters,The Journal of Supercomputing

当前位置： X-MOL 学术 › J. Supercomput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2020-05-23 , DOI: 10.1007/s11227-020-03319-6
Fang Lin , Yi Liu , Yayu Guo , Depei Qian

Continuous scaling-up of high-performance computing systems has brought challenges to the debugging and tuning of large-scale parallel programs. Firstly, to locate bugs in a program or tune its performance, programmer often needs to execute the program in a specified scale repeatedly, which consumes massive resources; secondly, due to the extensively used job scheduling systems, programmers can only submit their programs as jobs and cannot interact with them, which restricts debugging efficiency and flexibility. To address these challenges, this paper proposes an emulation system that supports debugging and tuning of large-scale parallel programs by executing parallel programs in the desired scale on a small cluster. The program is firstly executed in the desired scale on the target HPC system to record necessary information; then, programmers can choose and re-execute a subset of processes of the program repeatedly on a small cluster, during which the emulation system controls the execution of the processes, and programmers can debug their programs by attaching tools to the selected processes. Moreover, our system supports popular CPU+GPU heterogeneous architecture. The system is evaluated on a small cluster, while a 1000-node system is used as the target HPC system; experimental results demonstrate the accuracy and efficiency of emulation-execution.

中文翻译：

ELS：用于在小型集群上调试和调整大型并行程序的仿真系统

高性能计算系统的不断扩展给大规模并行程序的调试和调优带来了挑战。首先，为了定位程序中的错误或调优其性能，程序员往往需要以指定的规模重复执行程序，这会消耗大量资源；其次，由于作业调度系统的广泛使用，程序员只能将自己的程序作为作业提交，不能与之交互，限制了调试的效率和灵活性。为了应对这些挑战，本文提出了一种仿真系统，该系统通过在小型集群上以所需规模执行并行程序来支持大规模并行程序的调试和调优。该程序首先在目标HPC系统上以所需的规模执行以记录必要的信息；然后，程序员可以在一个小集群上重复选择和重新执行程序的一个子进程，在此期间仿真系统控制进程的执行，程序员可以通过将工具附加到所选进程来调试他们的程序。此外，我们的系统支持流行的 CPU+GPU 异构架构。系统在小集群上评估，1000节点系统作为目标HPC系统；实验结果证明了仿真执行的准确性和效率。系统在小集群上评估，1000节点系统作为目标HPC系统；实验结果证明了仿真执行的准确性和效率。系统在小集群上评估，1000节点系统作为目标HPC系统；实验结果证明了仿真执行的准确性和效率。

更新日期：2020-05-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文