A Ubiquitous Machine Learning Accelerator with Automatic Parallelization on FPGA,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Ubiquitous Machine Learning Accelerator with Automatic Parallelization on FPGA
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-10-01 , DOI: 10.1109/tpds.2020.2990924
Chao Wang , Lei Gong , Xi Li , Xuehai Zhou

Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

中文翻译：

在 FPGA 上实现自动并行化的无处不在的机器学习加速器

机器学习已广泛应用于各种新兴的数据密集型应用程序，必须通过强大的引擎进行优化和加速才能处理超大规模数据。最近，现场可编程门阵列 (FPGA) 上基于指令集的加速器已成为机器学习应用的一个有前途的主题。可以进一步调度定制指令以实现更高的指令级并行性。在本文中，我们为大规模数据密集型应用程序设计了一个具有无序自动并行化的无处不在的加速器。该加速器可容纳四种代表性应用，包括聚类算法、深度神经网络、基因组测序和协同过滤。为了提高粗粒度指令级并行性，加速器采用无序调度方法来实现并行数据流计算。我们使用彩色 Petri Net (CPN) 工具分析应用程序中的依赖关系，并在真实的 FPGA 平台上构建硬件原型。对于集群应用，加速器可以支持四种不同的算法，包括K-Means、SLINK、PAM和DBSCAN。对于协同过滤应用程序，它采用 Tanimoto、欧几里得、余弦和 Pearson 相关作为相似性度量。对于深度学习应用，我们为训练过程和推理过程实施了硬件加速器。最后，对于基因组测序，我们为 BWA-SW 算法设计了一个硬件加速器。实验结果表明，与英特尔处理器相比，加速器架构可以达到高达 25 倍的加速，且硬件成本合理，

更新日期：2020-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>