当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
cuTensor-tubal: Efficient Primitives for Tubal-rank Tensor Operations on GPUs
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-03-01 , DOI: 10.1109/tpds.2019.2940192
Tao Zhang , Xiao-Yang Liu , Xiaodong Wang , Anwar Walid

Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$16.91×,27.03×,38.97×,22.36×,15.43× speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$9.80× and $269.26 \times$269.26× speedups over multi-core CPU implementations.

中文翻译:

cuTensor-tubal:GPU 上 Tubal-rank 张量运算的有效原语

张量是高性能计算、大数据分析和机器学习的基石数据结构。然而,张量计算是计算密集型的,并且运行时间随着张量的大小而迅速增加。因此,在 GPU 等并行架构上设计高性能原语对于满足不断增长的数据处理需求的效率至关重要。现有的 GPU 基本线性代数子程序 (BLAS) 库(例如,NVIDIA cuBLAS)不提供张量基元。研究人员必须逐案实施和优化他们自己的张量算法,这种方式效率低下且容易出错。在本文中,我们为 GPU 上的 tubal-rank 张量模型开发了包含七个关键原语的 cuTensor-tubal 库:t-FFT、逆 t-FFT、t-product、t-SVD、t-QR、t-inverse、和 t 归一化。cuTensor-tubal 采用频域计算方案来暴露频域的可分离性,然后将tube-wise 和 slice-wise 并行映射到单指令多线程 (SIMT) GPU 架构上。为了获得良好的性能,我们优化了数据传输、内存访问,并分别为具有数据独立和数据相关计算模式的张量操作设计了批处理和流式并行化方案。在 t-product、t-SVD、t-QR、t-inverse 和 t-normalization 的评估中,cuTensor-tubal 达到了最大值 内存访问,并分别为具有数据独立和数据相关计算模式的张量操作设计批处理和流式并行化方案。在 t-product、t-SVD、t-QR、t-inverse 和 t-normalization 的评估中,cuTensor-tubal 达到了最大值 内存访问,并分别为具有数据独立和数据相关计算模式的张量操作设计批处理和流式并行化方案。在 t-product、t-SVD、t-QR、t-inverse 和 t-normalization 的评估中,cuTensor-tubal 达到了最大值$16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$16.91×,27.03×,38.97×,22.36×,15.43×在双 10 核 Xeon CPU 上运行的 CPU 实现分别加速。使用我们的库测试了两个应用程序,即基于 t-SVD 的视频压缩和低管秩张量完成,并实现了最大$9.80 \times$9.80×$269.26 \times$269.26× 多核 CPU 实现的加速。
更新日期:2020-03-01
down
wechat
bug