当前位置: X-MOL 学术arXiv.cs.MS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs
arXiv - CS - Mathematical Software Pub Date : 2021-02-16 , DOI: arxiv-2102.08463
Yu-hsuan Shih, Garrett Wright, Joakim andén, Johannes Blaschke, Alex H. Barnett

Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around $10^9$ nonuniform points per second, and (even including host-device transfer) is typically 4-10$\times$ faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90$\times$ faster at high accuracy and/or type 1 clustered point distributions. Finally we demonstrate a 6-18$\times$ speedup versus CPU in an X-ray diffraction 3D iterative reconstruction task at $10^{-12}$ accuracy, observing excellent multi-GPU weak scaling up to one rank per GPU.

中文翻译:

cuFINUFFT:一个负载均衡的GPU库,用于通用非均匀FFT

在包括图像重建和信号处理在内的许多应用中,非均匀快速傅立叶变换主导着计算成本。因此,我们针对单精度或双精度,针对维度2和3的类型1(不均匀到均匀)和类型2(均匀到不均匀)转换提供了一个基于GPU的通用CUDA库。无论是不均匀点的分布如何,它都可以通过缓存感知的点重新排序和共享内存中的负载平衡块扩展来实现给定用户要求的精度的高性能。在低精度下,这会使GPU上的吞吐量达到每秒约10 ^ 9 $非均匀点,并且(甚至包括主机设备传输)通常比最新的并行CPU代码FINUFFT(在28个线程上)快4-10 $ /倍。 )。与两个已建立的GPU代码相比,在高精度和/或类型1聚类点分布中,速度最高可提高90 $ \ times $。最后,我们在X射线衍射3D迭代重建任务中以$ 10 ^ {-12} $的精度演示了与CPU相比6-18倍的加速速度,观察到了卓越的多GPU弱扩展,每个GPU的扩展能力最高。
更新日期:2021-02-18
down
wechat
bug