ParaML: A Polyvalent Multi-core Accelerator for Machine Learning,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

当前位置： X-MOL 学术 › IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ParaML: A Polyvalent Multi-core Accelerator for Machine Learning
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( IF 2.9 ) Pub Date : 2020-09-01 , DOI: 10.1109/tcad.2019.2927523
Shengyuan Zhou , Qi Guo , Zidong Du , Daofu Liu , Tianshi Chen , Ling Li , Shaoli Liu , Jinhong Zhou , Olivier Temam , Xiaobing Feng , Xuehai Zhou , Yunji Chen

In recent years, machine learning (ML) techniques are proven to be powerful tools in various emerging applications. Traditionally, ML techniques are processed on general-purpose CPUs and GPUs, but their energy efficiencies are limited due to their excessive support for flexibility. As an efficient alternative to CPUs/GPUs, hardware accelerators are still limited as they often accommodate only a single ML technique (family). However, different problems may require different ML techniques, which implies that such accelerators may achieve poor learning accuracy or even be ineffective. In this paper, we present a polyvalent accelerator architecture integrated with multiple processing cores, called ParaML, which accommodates ten representative ML techniques, including

$k$

-means,

$k$

-nearest neighbors (

$k$

-NN), naive Bayes (NB), support vector machine (SVM), linear regression (LR), classification tree (CT), deep neural network (DNN), learning vector quantization (LVQ), parzen window (PW), and principal component analysis (PCA). Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, the single-core ParaML can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm² and consumes 596 mW only, estimated by ICC and PrimeTime PX with post-synthesis netlist, respectively. Compared with the NVIDIA K20M GPU (28-nm process), the single-core ParaML (65-nm process) is

$1.21\times $

faster, and can reduce the energy by

$137.93\times $

. We also compare the single-core ParaML with other accelerators. Compared with PRINS, single-core ParaML achieves

$72.09\times $

and

$2.57\times $

energy benefit for

$k$

-NN and

$k$

-means, respectively, and speeds up each query in

$k$

-NN by

$44.76\times $

. Compared with EIE, the single-core ParaML achieves

$5.02\times $

speedup and

$4.97\times $

energy benefit with

$11.62\times $

less area when evaluating with dense DNN. Compared with TPU, the single-core ParaML achieves

$2.45\times $

better power efficiency (5647 Gop/W versus 2300 Gop/W) with

$321.36\times $

less area. Compared to the single-core version, the 8-core ParaML will further improve the speedup up to

$3.98\times $

with an area of 13.44 mm² and a power of 2036 mW.

中文翻译：

ParaML：用于机器学习的多价多核加速器

近年来，机器学习 (ML) 技术被证明是各种新兴应用程序中的强大工具。传统上，ML 技术是在通用 CPU 和 GPU 上处理的，但由于对灵活性的过度支持，它们的能效受到限制。作为 CPU/GPU 的有效替代品，硬件加速器仍然受到限制，因为它们通常仅适用于单一的 ML 技术（系列）。然而，不同的问题可能需要不同的 ML 技术，这意味着此类加速器可能会实现较差的学习准确性甚至无效。在本文中，我们提出了一种与多个处理内核集成的多价加速器架构，称为 ParaML，它容纳了十种代表性的 ML 技术，包括

$千$

-方法，

$千$

-最近的邻居（

$千$

-NN)、朴素贝叶斯 (NB)、支持向量机 (SVM)、线性回归 (LR)、分类树 (CT)、深度神经网络 (DNN)、学习向量量化 (LVQ)、解析窗口 (PW) 和主成分分析 (PCA)。受益于我们对不同 ML 技术的计算原语和局部性属性的深入分析，单核 ParaML 可以在 3.51 mm ²的面积内执行高达 1056 GOP/s（例如加法和乘法），并且仅消耗 596 mW，估计分别由 ICC 和 PrimeTime PX 与后综合网表。与NVIDIA K20M GPU（28纳米工艺）相比，单核ParaML（65纳米工艺）是

$1.21\times $

更快，并且可以减少能量

$137.93\times $

. 我们还将单核 ParaML 与其他加速器进行了比较。与 PRINS 相比，单核 ParaML 实现了

$72.09\times $

和

$2.57\times $

能源效益

$千$

-NN 和

$千$

-分别表示，并加速每个查询