当前位置: X-MOL 学术Parallel Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An on-node scalable sparse incomplete LU factorization for a many-core iterative solver with Javelin
Parallel Computing ( IF 1.4 ) Pub Date : 2020-03-23 , DOI: 10.1016/j.parco.2020.102622
Joshua Dennis Booth , Gregory Bolet

We present a scalable incomplete LU factorization to be used as a preconditioner for solving sparse linear systems with iterative methods in the package called Javelin. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row (CSR) format. Moreover, this framework stresses the importance of co-designing dependent tasks, such as sparse factorization and triangular solves, on highly-threaded architectures. We compare our method to the past distributed methods for incomplete factorization (Aztec) and current multithreaded packages (WSMP) in order to demonstrate the importance of having highly threaded factorizations on many-core systems. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to ~ 42 × on 68 Intel Knights Landing cores and ~ 12 × on 14 Intel Haswell cores. Moreover, this work provides insight into how the new data-structure impacts iteration counts, and provides insight into future improvements, such as point to GPUs.



中文翻译:

使用标枪的多核迭代求解器的节点上可扩展的稀疏不完整LU分解

我们在Javelin包中提出了可扩展的不完全LU分解,可以用作使用迭代方法解决稀疏线性系统的前提。标枪通过将系数矩阵打包成允许高性能稀疏矩阵矢量乘法和稀疏三角求解且开销最小的格式,可以在共享内存多核系统上改进并行分解。该框架通过使用传统压缩,稀疏行(CSR)格式的传统置换,点对点线程同步,任务分配和分段前缀扫描的集合来实现这些目标。此外,该框架强调了在高线程体系结构上共同设计依赖任务的重要性,例如稀疏分解和三角求解。我们将我们的方法与过去的不完全分解的分布式方法(Aztec)和当前的多线程程序包(WSMP)进行了比较,以证明在多核系统上进行高度线程化的分解的重要性。使用这些更改,可以使用传统的填充和掉落容限方法,同时仍然能够在68个Intel Knights Landing内核上观察到约42倍的加速,在14个Intel Haswell内核上观察到约12倍的加速。此外,这项工作可以洞悉新数据结构如何影响迭代次数,并洞悉未来的改进,例如指向GPU。可以使用传统的填充和掉落容限方法,同时仍然能够在68个Intel Knights Landing内核上观察到约42倍的加速,在14个Intel Haswell内核上观察到约12倍的加速。此外,这项工作可以洞悉新数据结构如何影响迭代次数,并洞悉未来的改进,例如指向GPU。可以使用传统的填充和掉落容限方法,同时仍然能够在68个Intel Knights Landing内核上观察到约42倍的加速,在14个Intel Haswell内核上观察到约12倍的加速。此外,这项工作可以洞悉新数据结构如何影响迭代次数,并洞悉未来的改进,例如指向GPU。

更新日期:2020-03-23
down
wechat
bug