Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks,IEEE Open Journal of Circuits and Systems

当前位置： X-MOL 学术 › IEEE Open J. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks
IEEE Open Journal of Circuits and Systems ( IF 2.4 ) Pub Date : 2021-01-25 , DOI: 10.1109/ojcas.2020.3041685
Po-Tsang Huang , I-Chen Wu , Chin-Yang Lo , Wei Hwang

Deep convolutional neural networks (CNNs) are difficult to be fully deployed to edge devices because of both memory-intensive and computation-intensive workloads. The energy efficiency of CNNs is dominated by convolution computation and off-chip memory (DRAM) accesses, especially for DRAM accesses. In this article, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, a tile-based row-independent compression (TRC) method with relative indexing memory is adopted for storing none-zero terms. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row-to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves

$1.79\times$

speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.

中文翻译：

具有基于图块的行独立压缩内存的节能加速器设计，用于稀疏压缩卷积神经网络

深度卷积神经网络（CNN）难以完全部署到边缘设备，因为这需要占用大量内存和大量计算资源。卷积神经网络的能源效率主要由卷积计算和片外存储器（DRAM）访问控制，尤其是对于DRAM访问而言。本文通过减少DRAM访问并消除零操作数计算，提出了一种用于稀疏压缩CNN的节能加速器。权重压缩用于稀疏压缩的CNN，以减少所需的内存容量/带宽和大部分连接。因此，采用具有相对索引存储器的基于瓦片的行独立压缩（TRC）方法来存储非零项。此外，工作负载是根据渠道进行分配的，以提高任务并行度，并采用全行至全行非零元素乘法来跳过冗余计算。在密集加速器上的仿真结果表明，所提出的加速器可以

$ 1.79 \次

加快并减少VGG-16的23.51％，69.53％和88.67％的片上存储器大小，能耗以及DRAM访问。

更新日期：2021-01-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文