当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Computer architecture and high performance computing
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2021-07-28 , DOI: 10.1002/cpe.6526
Raphael Y. Camargo 1 , Fabrizio Marozzo 2 , Wellington Martins 3
Affiliation  

In this special issue of Concurrency and Computation Practice and Experience, we are pleased to present eight selected papers that were previously presented at the Brazilian “XX Simpósio em Sistemas Computacionais de Alto Desempenho,” WSCAD 2019. The event was held in conjunction with the 31st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2019, in Campo Grande, MS, Brazil, from October 15 to 18, 2019. The WSCAD workshop has been presenting important research in the fields of computer architectures, high performance computing, and distributed systems, since the beginning of the 2000s.

The scope of the current special issue is broad and representative, with different forms of contributions to our discipline: methodological papers, technology papers, application papers, and system papers. The topics covered in the papers include architecture issues, compiler optimization, performance evaluation, parallel algorithms, energy efficiency, and applications.

The title of the first paper is “Structural testing for communication events into loops of message-passing parallel programs,” by Diaz et al.1 In this paper, the authors propose new structural testing criteria for message-passing parallel programs, focusing on defects from communication primitives into loops. A new test model is presented to support their criteria for structural testing of MPI-applications. The testing criteria are validated through experimental studies using a tool called ValiMPI. The results show that unknown defects from communication and synchronization events can be revealed in different loop iterations, increasing the quality of message-passing parallel programs.

In the second contribution, entitled “Smart selection of optimizations in dynamic compilers,” Rosario et al.2 present an approach that uses machine learning to select sequences of optimization for dynamic compilation that considers both code quality and compilation overhead. Their approach starts by training a model, offline, with a knowledge bank of those sequences with low overhead and high-quality code generation capability using a genetic heuristic. Then, this bank is used to guide the smart selection of optimizations sequences for the compilation of code fragments during the emulation of an application. The proposed strategy is evaluated in two LLVM-based dynamic binary translators, namely, OI-DBT and HQEMU, showing that these two translators can achieve average speedups of 1.26× and 1.15× in MiBench and Spec Cpu benchmarks, respectively.

In the third contribution, entitled “Memory allocation anomalies in high-performance computing applications: A study with numerical simulations,” Gomes et al.3 propose a method for identifying, locating, characterizing, and fixing allocation anomalies, and a tool for developers to apply the method. A numerical simulator that approximates the solutions to partial differential equations using a finite element method is used in the experiments. It is shown that taming allocation anomalies in the simulator reduces both its execution time and the memory footprint of its processes, irrespective of the specific heap allocator being employed with it. They conclude that the developer of HPC applications can benefit from the method and tool during the software development cycle.

The fourth contribution, entitled “Investigating memory prefetcher performance over parallel applications: From real to simulated,” by Girelli et al.,4 contributes to shed light on the memory prefetcher's role in the performance of parallel high-performance computing applications, considering the prefetcher algorithms offered by both the real hardware and the simulators. The authors performed a careful experimental investigation, executing the NAS parallel benchmark (NPB) on a real Skylake machine, and as well in a simulated environment with the ZSim and Sniper simulators, taking into account the prefetcher algorithms offered by both Skylake and the simulators. The experimental results show that: (i) prefetching from the L3 to L2 cache presents better performance gains, (ii) the memory contention in the parallel execution constrains the prefetcher's effect, (iii) Skylake's parallel memory contention is poorly simulated by ZSim and Sniper, and (iv) Skylake's noninclusive L3 cache hinders the accurate simulation of NPB with the Sniper's prefetchers.

In the fifth contribution, entitled “Energy efficiency and portability of oil and gas simulations on multicore and graphics processing unit architectures,” Serpa et al.5 propose three optimizations for an oil and gas application, reverse time migration (RTM), which reduce the floating-point operations by changing the equation derivatives. They evaluate these optimizations in different multicore and GPU architectures, investigating the impact of different APIs on the performance, energy efficiency, and portability of the code. The experimental results show that the dedicated CUDA implementation running on the NVIDIA Volta architecture has the best performance and energy efficiency for RTM on GPUs, while the OpenMP version is the best for Intel Broadwell in the multicore. Also, the OpenACC version, which has a lower programming effort and executes on both architectures, has up to 20% better performance and energy efficiency than the nonportable ones.

In the sixth paper, entitled “An open computing language-based parallel Brute Force algorithm for formal concept analysis on heterogeneous architectures,” Novais et al.6 propose and evaluate an Open Computing Language (OpenCL)-based Brute Force algorithm for formal concept extraction on heterogeneous architectures (CPU + GPU and CPU + FPGA). The CPU + GPU architecture presents higher performance and scalability than other architectures when the Brute Force algorithm processes high dimensional contexts with many objects and attributes. Their parallel approach shows performance results up to 18× better than a smarter sequential algorithm called Data-Peeler. Moreover, the Brute Force algorithm running on CPU + GPU architecture has greater energy efficiency, reaching at least 1.79× more operations per energy consumption than other algorithms on different architectures explored in the work.

In the seventh paper, entitled “Contextual contracts for component-oriented resource abstraction in a cloud of high performance computing services,” Junior et al.7 present HPC Shelf, a cloud computing services platform to build and deploy large-scale parallel computing systems. They introduce Alite, the contextual contract system of HPC Shelf, to select component implementations according to requirements of the host application, target parallel computing platform characteristics (e.g., clusters and MPPs), quality of service (QoS) properties, and cost restrictions. It is evaluated through a small-scale case study employing two complementary component-based frameworks. The first one aims to represent components that implement linear algebra computations based on the BLAS interface. In turn, the second one aims to represent parallel computing platforms on the IaaS cloud offered by Amazon EC2 Service.

The last paper in this special issue, “High-performance IO for seismic processing on the cloud” authored by Guimarães et al.,8 analyzes the main file structures currently used to store seismic data and propose a new intermediate data structure to improve IO performance while still complying with established standards. They show that, throughout a common workflow in seismic data analysis, the IO performance gain greatly surpasses the overhead of translating data to the intermediate structure. The approach enables a speedup of up to 208 times in reading time when using classical standards (e.g., SEG-Y) and the intermediate structure is up to 1.8 times more efficient than modern formats (e.g., ASDF). Considering cache-friendly applications, the speedups over the direct use of SEG-Y reach 8000 times. They also performed a cost analysis on the AWS cloud showing that HDDs can be 1.25 times more cost-effective than SSDs.

The research papers presented in this special issue provide insights in fields related to high performance computing, including performance evaluation, parallel algorithms, and applications in science and engineering. We believe that the main contributions presented in the research papers are timely and important, and hope that readers can benefit from the papers and contribute to these rapidly growing areas.

Many individuals contributed a great deal of time and energy toward the success of this special issue. We would like to thank all the authors who provided valuable contributions to this special issue. We are also grateful to the reviewers for their many hours of dedicated efforts, with valuable feedback to the authors. Finally, we would also like to express our gratitude to the Editor-in-Chief of CCPE, for his advice, vision, and support, making this special issue possible.



中文翻译:

计算机体系结构和高性能计算

在本期并发与计算实践与经验特刊中,我们很高兴展示之前在 WSCAD 2019 巴西“XX Simpósio em Sistemas Computacionais de Alto Desempenho”上发表的八篇精选论文。该活动与第 31 届计算机架构和高性能计算国际研讨会,SBAC-PAD 2019,于 2019 年 10 月 15 日至 18 日在巴西密歇根州坎波格兰德举行。 WSCAD 研讨会展示了计算机架构、高性能计算、和分布式系统,自 2000 年代初。

本期特刊范围广泛,具有代表性,对本学科有不同形式的贡献:方法论论文、技术论文、应用论文和系统论文。论文涵盖的主题包括架构问题、编译器优化、性能评估、并行算法、能效和应用。

第一篇论文的标题是 Diaz 等人的“将通信事件结构测试到消息传递并行程序的循环中”。1在本文中,作者为消息传递并行程序提出了新的结构测试标准,重点关注从通信原语到循环的缺陷。提出了一种新的测试模型来支持他们对 MPI 应用程序进行结构测试的标准。测试标准是通过使用名为 ValiMPI 的工具进行的实验研究来验证的。结果表明,可以在不同的循环迭代中发现来自通信和同步事件的未知缺陷,从而提高了消息传递并行程序的质量。

在题为“动态编译器中优化的智能选择”的第二篇文章中,Rosario 等人。2提出一种使用机器学习来选择动态编译优化序列的方法,该方法同时考虑了代码质量和编译开销。他们的方法首先是离线训练模型,并使用遗传启发式算法,使用具有低开销和高质量代码生成能力的那些序列的知识库。然后,该库用于指导优化序列的智能选择,以便在应用程序仿真期间编译代码片段。所提出的策略在两个基于 LLVM 的动态二进制转换器,即 OI-DBT 和 HQEMU 中进行了评估,表明这两个转换器在 MiBench 和 Spec Cpu 基准测试中分别可以实现 1.26 倍和 1.15 倍的平均加速。

在第三篇贡献中,题为“高性能计算应用程序中的内存分配异常:数值模拟研究”,Gomes 等人。3提出了一种用于识别、定位、表征和修复分配异常的方法,以及一种供开发人员应用该方法的工具。实验中使用了数值模拟器,该模拟器使用有限元方法来逼近偏微分方程的解。结果表明,在模拟器中驯服分配异常可以减少其执行时间和其进程的内存占用,而与使用的特定堆分配器无关。他们得出结论,HPC 应用程序的开发人员可以在软件开发周期中受益于该方法和工具。

第四篇贡献,题为“研究并行应用程序上的内存预取器性能:从真实到模拟”,Girelli 等人,4考虑到真实硬件和模拟器提供的预取器算法,有助于阐明内存预取器在并行高性能计算应用程序的性能中的作用。作者进行了仔细的实验​​研究,在真实的 Skylake 机器上以及在模拟环境中使用 ZSim 和 Sniper 模拟器执行 NAS 并行基准测试 (NPB),同时考虑了 Skylake 和模拟器提供的预取算法。实验结果表明:(i)从L3预取到L2缓存具有更好的性能提升,(ii)并行执行中的内存争用限制了预取器的效果,(iii)ZSim和Sniper对Skylake的并行内存争用模拟不佳, 和 (iv) Skylake'

在第五篇贡献中,题为“多核和图形处理单元架构上石油和天然气模拟的能源效率和便携性”,Serpa 等人。5为石油和天然气应用提出了三个优化方案,即逆时偏移 (RTM),它通过改变方程导数来减少浮点运算。他们在不同的多核和 GPU 架构中评估这些优化,研究不同 API 对代码性能、能效和可移植性的影响。实验结果表明,运行在 NVIDIA Volta 架构上的专用 CUDA 实现对于 GPU 上的 RTM 具有最佳的性能和能效,而 OpenMP 版本在多核中对于 Intel Broadwell 的性能最佳。此外,OpenACC 版本具有更低的编程工作量并可在两种架构上执行,其性能和能效比非便携式版本高 20%。

在题为“一种基于开放计算语言的并行蛮力算法,用于异构架构的形式概念分析”的第六篇论文中,Nowais 等人。6提出并评估一种基于开放计算语言 (OpenCL) 的蛮力算法,用于异构架构(CPU + GPU 和 CPU + FPGA)上的形式概念提取。当 Brute Force 算法处理具有许多对象和属性的高维上下文时,CPU + GPU 架构呈现出比其他架构更高的性能和可扩展性。他们的并行方法显示的性能结果比称为 Data-Peeler 的更智能的顺序算法高 18 倍。此外,在 CPU + GPU 架构上运行的 Brute Force 算法具有更高的能效,与工作中探索的不同架构上的其他算法相比,每能耗至少达到 1.79 倍的操作数。

在题为“高性能计算服务云中面向组件的资源抽象的上下文契约”的第七篇论文中,Junior 等人。7推出 HPC Shelf,一个用于构建和部署大规模并行计算系统的云计算服务平台。他们引入了 HPC Shelf 的上下文契约系统 Alite,以根据主机应用程序的要求、目标并行计算平台特性(例如集群和 MPP)、服务质量 (QoS) 属性和成本限制来选择组件实现。它通过使用两个互补的基于组件的框架的小规模案例研究进行评估。第一个旨在表示基于 BLAS 接口实现线性代数计算的组件。反过来,第二个旨在代表 Amazon EC2 服务提供的 IaaS 云上的并行计算平台。

本特刊的最后一篇论文“用于云端地震处理的高性能 IO”,由 Guimarães 等人撰写,8分析当前用于存储地震数据的主要文件结构,并提出一种新的中间数据结构,以提高 IO 性能,同时仍符合既定标准。他们表明,在地震数据分析的通用工作流程中,IO 性能的提升大大超过了将数据转换为中间结构的开销。当使用经典标准(例如,SEG-Y)时,该方法可将读取时间加快多达 208 倍,并且中间结构的效率比现代格式(例如,ASDF)高 1.8 倍。考虑到缓存友好的应用程序,比直接使用 SEG-Y 的加速达到 8000 倍。他们还对 AWS 云进行了成本分析,结果表明 HDD 的成本效益是 SSD 的 1.25 倍。

本期特刊中的研究论文提供了与高性能计算相关领域的见解,包括性能评估、并行算法以及科学和工程中的应用。我们认为研究论文中提出的主要贡献是及时和重要的,希望读者能够从论文中受益并为这些快速增长的领域做出贡献。

许多人为此特刊的成功贡献了大量时间和精力。我们要感谢所有为本期特刊做出宝贵贡献的作者。我们也感谢审稿人付出了很多时间,并为作者提供了宝贵的反馈。最后,我们还要感谢 CCPE 主编,感谢他的建议、远见和支持,使本期特刊成为可能。

更新日期:2021-08-23
down
wechat
bug