Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-05-10 , DOI: 10.1145/3447775
Jose M. Rodriguez Borbon ₁ , Junjie Huang ₁ , Bryan M. Wong ₁ , Walid Najjar ₁

Affiliation

QR decomposition is one of the most useful factorization kernels in modern numerical linear algebra algorithms. In particular, the decomposition of tall-and-skinny matrices (TSMs) has major applications in areas including scientific computing, machine learning, image processing, wireless networks, and numerical methods. Traditionally, CPUs and GPUs have achieved better throughput on these applications by using large cache hierarchies and compute cores running at a high frequency, leading to high power consumption. With the advent of heterogeneous platforms, however, FPGAs are emerging as a promising viable alternative. In this work, we propose a high-throughput FPGA-based engine that has a very high computational efficiency (ratio of achieved to peak throughput) compared to similar QR solvers running on FPGAs. Although comparable QR solvers achieve an efficiency of 36%, our design exhibits an efficiency of 54%. For TSMs, our experimental results show that our design can outperform highly optimized QR solvers running on CPUs and GPUs. For TSMs with more than 50K rows, our design outperforms the Intel MKL solver running on an Intel quad-core processor by a factor of 1.5×. For TSMs containing 256 columns or less, our design outperforms the NVIDIA CUBLAS solver running on a K40 GPU by a factor of 3.0×. In addition to being fast, our design is energy efficient—competing platforms execute up to 0.6 GFLOPS/Joule, whereas our design executes more than 1.0 GFLOPS/Joule.

中文翻译：

在 FPGA 上加速高瘦矩阵的并行阻塞 QR 分解

QR 分解是现代数值线性代数算法中最有用的分解核之一。特别是，高瘦矩阵（TSM）的分解在科学计算、机器学习、图像处理、无线网络和数值方法等领域具有重要应用。传统上，CPU 和 GPU 通过使用大型缓存层次结构和以高频率运行的计算内核来在这些应用程序上实现更好的吞吐量，从而导致高功耗。然而，随着异构平台的出现，FPGA 正在成为一种很有前途的可行替代方案。在这项工作中，我们提出了一种基于 FPGA 的高吞吐量引擎，与在 FPGA 上运行的类似 QR 求解器相比，该引擎具有非常高的计算效率（达到峰值吞吐量的比率）。尽管可比的 QR 求解器实现了 36% 的效率，但我们的设计展示了 54% 的效率。对于 TSM，我们的实验结果表明，我们的设计可以胜过在 CPU 和 GPU 上运行的高度优化的 QR 求解器。对于超过 50K 行的 TSM，我们的设计比在英特尔四核处理器上运行的英特尔 MKL 求解器高出 1.5 倍。对于包含 256 列或更少列的 TSM，我们的设计比在 K40 GPU 上运行的 NVIDIA CUBLAS 求解器高出 3.0 倍。除了速度快之外，我们的设计还具有能源效率——竞争平台的执行速度高达 0.6 GFLOPS/Joule，而我们的设计执行速度超过 1.0 GFLOPS/Joule。对于超过 50K 行的 TSM，我们的设计比在英特尔四核处理器上运行的英特尔 MKL 求解器高出 1.5 倍。对于包含 256 列或更少列的 TSM，我们的设计比在 K40 GPU 上运行的 NVIDIA CUBLAS 求解器高出 3.0 倍。除了速度快之外，我们的设计还具有能源效率——竞争平台的执行速度高达 0.6 GFLOPS/Joule，而我们的设计执行速度超过 1.0 GFLOPS/Joule。对于超过 50K 行的 TSM，我们的设计比在英特尔四核处理器上运行的英特尔 MKL 求解器高出 1.5 倍。对于包含 256 列或更少列的 TSM，我们的设计比在 K40 GPU 上运行的 NVIDIA CUBLAS 求解器高出 3.0 倍。除了速度快之外，我们的设计还具有能源效率——竞争平台的执行速度高达 0.6 GFLOPS/Joule，而我们的设计执行速度超过 1.0 GFLOPS/Joule。

更新日期：2021-05-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>