当前位置: X-MOL 学术ACM Trans. Reconfig. Technol. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FPGA Logic Block Architectures for Efficient Deep Learning Inference
ACM Transactions on Reconfigurable Technology and Systems ( IF 3.1 ) Pub Date : 2020-06-03 , DOI: 10.1145/3393668
Mohamed Eldafrawy 1 , Andrew Boutros 1 , Sadegh Yazdanshenas 1 , Vaughn Betz 1
Affiliation  

Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.

中文翻译:

用于高效深度学习推理的 FPGA 逻辑块架构

与半精度或单精度浮点相比,通过在单位面积上启用更多的乘法运算,降低深度神经网络 (DNN) 推理加速器的精度可以产生很大的效率增益,而精度下降很少或没有下降。广泛的精度落在硬件效率与精度的帕累托最优曲线上,没有单一精度占主导地位,这使得 FPGA 的可变精度功能非常有价值。我们提出了三种类型的逻辑块架构增强功能,并全面评估了总共六种架构,这些架构提高了在软结构中实现的乘法和加法的面积效率。增加 LUT 的可分解性并向 ALM(4 位加法器双链架构)添加两个加法器可将算术重型机器学习 (ML) 内核的面积减少 1.5 倍,同时提高他们的速度。此外,这种架构还将一般应用的逻辑面积减少了 6%,而关键路径延迟仅增加了 1%。然而,我们的影响最大的选项(向逻辑集群添加 9 位影子乘法器)将 ML 内核的面积和关键路径延迟分别减少了 2.4 倍和 1.2 倍。对于一般应用来说,这些巨大的收益是以增加 15% 的逻辑面积为代价的。
更新日期:2020-06-03
down
wechat
bug