当前位置: X-MOL 学术IEEE Open J. Circuits Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
IEEE Open Journal of Circuits and Systems ( IF 2.4 ) Pub Date : 2021-01-25 , DOI: 10.1109/ojcas.2020.3047225
Suhas Shivapakash , Hardik Jain , Olaf Hellwich , Friedel Gerfers

Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption and significant bandwidth requirement to access the data from the off-chip DRAM. Reducing the data movement between the on-chip and off-chip DRAM is the main criteria to achieve high throughput and overall better energy efficiency. Our proposed multi-bit accelerator achieves these goals by employing the truncation of the partial sum (Psum) results of the preceding layer before feeding it into the next layer. We exhibit the architecture by inferencing 32-bits for the first convolution layers and sequentially truncate the bits on the MSB/LSB of integer and fractional part without any further training on the original network. At the last fully connected layer, the top-1 accuracy is maintained with the reduced bit width of 14 and top-5 accuracy upto 10-bit width. The computation engine consists of an systolic array of 1024 processing elements (PE). Large CNNs such as AlexNet, MobileNet, SqueezeNet and EfficientNet were used as benchmark CNN model and Virtex Ultrascale FPGA was used to test the architecture. The proposed truncation scheme has 49% power reduction and resource utilization was reduced by 73.25% for LUTs (Look-up tables), 68.76% for FFs (Flip-Flops), 74.60% for BRAMs (Block RAMs) and 79.425% for Digital Signal Processors (DSPs) when compared with the 32 bits architecture. The design achieves a performance of 223.69 GOPS on a Virtex Ultrascale FPGA, the design has a overall gain of $3.63 \,\, \times $ throughput when compared to other prior FPGA accelerators. In addition, the overall power consumption is $4.5 \,\, \times $ lower when compared to other prior architectures. The ASIC version of the accelerator was designed in 22nm FDSOI CMOS process to achieve a overall throughput of 2.03 TOPS/W with a total power consumption of 791 mW and with a area of 1 mm $\times \,\, 1.2$ mm.

中文翻译:

用于内存禁止的深度神经网络的多位加速器的功率效率增强

卷积神经网络(CNN)在当代人工智能系统中被广泛采用。但是,这些模型在各层之间具有数百万个连接,这既禁止内存使用,又在计算上昂贵。在嵌入式移动应用程序上使用这些模型存在资源受限,功耗高和显着带宽需求的问题,以便从片外DRAM访问数据。减少片上DRAM和片外DRAM之间的数据移动是实现高吞吐量和总体上更好的能源效率的主要标准。我们提出的多位加速器通过将上一层的部分和(Psum)结果截断,然后再将其馈送到下一层来实现这些目标。我们通过推断第一个卷积层的32位并依次截断整数和小数部分的MSB / LSB上的位来展示该体系结构,而无需在原始网络上进行任何进一步的培训。在最后一个完全连接的层上,通过减小的位宽度14和保持最高的10位宽度的top-5精度来保持top-1的精度。计算引擎由一个包含1024个处理元素(PE)的脉动阵列组成。大型CNN(例如AlexNet,MobileNet,SqueezeNet和EfficientNet)用作基准CNN模型,而Virtex Ultrascale FPGA用于测试体系结构。拟议的截断方案降低了49%的功耗,LUT(查找表),FF(触发器),68.66%(BRAM),BRAM(Block RAM)和74.60%降低了资源利用率。与32位架构相比,数字信号处理器(DSP)为425%。该设计在Virtex Ultrascale FPGA上实现了223.69 GOPS的性能,该设计的整体增益为 $ 3.63 \,\,\ times $ 与其他现有的FPGA加速器相比,吞吐量更高。此外,整体功耗为 $ 4.5 \,\,\ times $ 与其他现有架构相比更低。加速器的ASIC版本采用22nm FDSOI CMOS工艺设计,可实现2.03 TOPS / W的总吞吐量,总功耗为791mW,面积为1mm $ \ times \,\,1.2 $ 毫米
更新日期:2021-01-26
down
wechat
bug