当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ScalAna: Automating Scaling Loss Detection with Graph Analysis
arXiv - CS - Performance Pub Date : 2020-09-03 , DOI: arxiv-2009.01692
Yuyang Jin and Haojie Wang and Teng Yu and Xiongchao Tang and Torsten Hoefler and Xu Liu and Jidong Zhai

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.

中文翻译:

ScalAna:使用图形分析自动进行缩放损失检测

由于进程间通信、阿姆达尔定律和资源争用,将并行程序扩展到现代超级计算机具有挑战性。用于查找此类扩展瓶颈的性能分析工具基于分析或跟踪。分析会产生低开销,但不会捕获根本原因分析所需的详细依赖关系。跟踪以高昂的开销收集所有信息。在这项工作中,我们设计了 ScalAna,它使用静态分析技术来实现两全其美——它以类似于分析的成本实现跟踪的可分析性。ScalAna 首先利用静态编译器技术构建程序结构图,该图记录主要计算和通信模式以及程序的控制结构。在运行时,我们采用轻量级技术根据图结构收集性能数据并生成程序性能图。有了这个图,我们提出了一种新方法,称为回溯根本原因检测,它可以自动有效地检测缩放损失的根本原因。我们使用实际应用程序评估 Scalana。结果表明,我们的方法可以有效地定位实际应用程序的伸缩性损失的根本原因,并且对多达 2,048 个进程平均产生 1.73% 的开销。通过修复 ScalAna 在 2,048 个进程上检测到的根本原因,我们实现了高达 11.11% 的性能提升。它可以自动有效地检测缩放损失的根本原因。我们使用实际应用程序评估 Scalana。结果表明,我们的方法可以有效地定位实际应用程序的伸缩性损失的根本原因,并且对多达 2,048 个进程平均产生 1.73% 的开销。通过修复 ScalAna 在 2,048 个进程上检测到的根本原因,我们实现了高达 11.11% 的性能提升。它可以自动有效地检测缩放损失的根本原因。我们使用实际应用程序评估 Scalana。结果表明,我们的方法可以有效地定位实际应用程序的伸缩性损失的根本原因,并且对多达 2,048 个进程平均产生 1.73% 的开销。通过修复 ScalAna 在 2,048 个进程上检测到的根本原因,我们实现了高达 11.11% 的性能提升。
更新日期:2020-09-04
down
wechat
bug