Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-08-01 , DOI: 10.1109/tpds.2020.2974744
Shijie Zhou , Rajgopal Kannan , Viktor K. Prasanna

Matrix Factorization (MF) based on Stochastic Gradient Descent (SGD) is a powerful machine learning technique to derive hidden features of objects from observations. In this article, we design a highly parallel architecture based on Field-Programmable Gate Array (FPGA) to accelerate the training process of the SGD-based MF algorithm. We identify the challenges for the acceleration and propose novel algorithmic optimizations to overcome them. By transforming the SGD-based MF algorithm into a bipartite graph processing problem, we propose a 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to achieve significant speedup. First, we develop a fast heuristic graph partitioning approach to partition the bipartite graph into induced subgraphs; this enables to efficiently use the on-chip memory resources of FPGA for data reuse and completely hide the data communication between FPGA and external memory. Second, we partition all the edges of each subgraph into non-overlapping matchings to extract the maximum parallelism. Third, we propose a batching algorithm to schedule the execution of the edges inside each matching to reduce the memory access conflicts to the on-chip RAMs of FPGA. Compared with non-optimized FPGA-based baseline designs, the proposed optimizations result in up to 60× data dependency reduction, 4.2× bank conflict reduction, and 15.4× speedup. We evaluate the performance of our design using a state-of-the-art FPGA device. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training very large real-life sparse matrices. Compared with highly-optimized GPU-based accelerators, our FPGA accelerator achieves up to 12.7× speedup. Based on our optimization methodology, we also implement a software-based design on a multi-core platform, which demonstrates 1.3× speedup compared with the state-of-the-art multi-core implementation.

中文翻译：

在 FPGA 上加速基于随机梯度下降的矩阵分解

基于随机梯度下降 (SGD) 的矩阵分解 (MF) 是一种强大的机器学习技术，可从观察中导出对象的隐藏特征。在本文中，我们设计了一个基于现场可编程门阵列 (FPGA) 的高度并行架构，以加速基于 SGD 的 MF 算法的训练过程。我们确定了加速的挑战，并提出了新的算法优化来克服它们。通过将基于 SGD 的 MF 算法转换为二部图处理问题，我们提出了一种 3 级分层分区方案，该方案能够使边的冲突最小化调度和处理以实现显着的加速。首先，我们开发了一种快速启发式图划分方法，将二部图划分为诱导子图；这使得能够有效地利用 FPGA 的片上存储器资源进行数据重用，并完全隐藏 FPGA 和外部存储器之间的数据通信。其次，我们将每个子图的所有边划分为不重叠的匹配以提取最大并行度。第三，我们提出了一种批处理算法来调度每个匹配内边的执行，以减少对 FPGA 片上 RAM 的存储器访问冲突。与未优化的基于 FPGA 的基线设计相比，所提出的优化导致高达 60 倍的数据依赖性减少、4.2 倍的组冲突减少和 15.4 倍的加速。我们使用最先进的 FPGA 设备评估我们设计的性能。实验结果表明，我们的 FPGA 加速器可维持高达每秒 2170 亿次浮点运算 (GFLOPS) 的高计算吞吐量，用于训练非常大的现实生活中的稀疏矩阵。与高度优化的基于 GPU 的加速器相比，我们的 FPGA 加速器实现了高达 12.7 倍的加速。基于我们的优化方法，我们还在多核平台上实现了基于软件的设计，与最先进的多核实现相比，该设计实现了 1.3 倍的加速。

更新日期：2020-08-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11