Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665),arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665)
arXiv - CS - Performance Pub Date : 2021-02-17 , DOI: arxiv-2102.08505
Adetokunbo A. Adedoyin, Christian F. A. Negre, Jamaludin Mohd-Yusof, Nicolas Bock, Daniel Osei-Kuffuor, Jean-Luc Fattebert, Michael E. Wall, Anders M. N. Niklasson, Susan M. Mniszewski

As we rapidly approach the frontiers of ultra large computing resources, software optimization is becoming of paramount interest to scientific application developers interested in efficiently leveraging all available on-Node computing capabilities and thereby improving a requisite science per watt metric. The scientific application of interest here is the Basic Math Library (BML) that provides a singular interface for linear algebra operation frequently used in the Quantum Molecular Dynamics (QMD) community. The provisioning of a singular interface indicates the presence of an abstraction layer which in-turn suggests commonalities in the code-base and therefore any optimization or tuning introduced in the core of code-base has the ability to positively affect the performance of the aforementioned library as a whole. With that in mind, we proceed with this investigation by performing a survey of the entirety of the BML code-base, and extract, in form of micro-kernels, common snippets of code. We introduce several optimization strategies into these micro-kernels including 1.) Strength Reduction 2.) Memory Alignment for large arrays 3.) Non Uniform Memory Access (NUMA) aware allocations to enforce data locality and 4.) appropriate thread affinity and bindings to enhance the overall multi-threaded performance. After introducing these optimizations, we benchmark the micro-kernels and compare the run-time before and after optimization for several target architectures. Finally we use the results as a guide to propagating the optimization strategies into the BML code-base. As a demonstration, herein, we test the efficacy of these optimization strategies by comparing the benchmark and optimized versions of the code.

中文翻译：

针对多核体系结构的递归电子结构求解器的性能优化（LA-UR-20-26665）

随着我们迅速进入超大型计算资源的前沿，软件优化已成为科学应用程序开发人员的头等大事，他们对有效利用所有可用的节点上计算功能，从而改善必需的每瓦特科学方法感兴趣。这里感兴趣的科学应用是基础数学库（BML），它为量子分子动力学（QMD）社区中常用的线性代数运算提供了一个奇异的接口。提供单一接口表示存在抽象层，该抽象层反过来暗示了代码库中的共性，因此在代码库核心中引入的任何优化或调整都具有积极影响上述库的性能的能力。整体上考虑到这一点，我们通过对整个BML代码库进行调查来进行此调查，并以微内核的形式提取常见的代码段。我们在这些微内核中引入了几种优化策略，包括1.）降低强度2.）大型阵列的内存对齐3.）非统一内存访问（NUMA）感知分配以强制执行数据局部性和4.）适当的线程亲和力和绑定增强整体多线程性能。在介绍了这些优化之后，我们对微内核进行了基准测试，并比较了几种目标体系结构优化前后的运行时间。最后，我们将结果用作指导，将优化策略传播到BML代码库中。作为示范，

更新日期：2021-02-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>