当前位置: X-MOL 学术IEEE Trans. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast and Efficient Convolutional Accelerator for Edge Computing
IEEE Transactions on Computers ( IF 3.6 ) Pub Date : 2020-01-01 , DOI: 10.1109/tc.2019.2941875
Arash Ardakani , Carlo Condo , Warren J. Gross

Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones. In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the state-of-the-art CNNs by up to 1.9× depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5× to 17.5× faster, and is between 2.1× and 4.5× more energy efficient while occupying 2.1× less silicon area.

中文翻译:

用于边缘计算的快速高效卷积加速器

卷积神经网络 (CNN) 是机器学习中的重要方法。然而,它们的高复杂性和高能耗使它们难以嵌入到需要实时处理的边缘移动应用程序中,例如智能手机。为了满足边缘设备的实时约束,最近提出的定制硬件 CNN 加速器利用并行处理元件 (PE) 来提高吞吐量。然而,PE 和高内存带宽的这种直接并行化需要大量数据移动,从而导致大量能源消耗。因此,在设计针对边缘设备的带宽受限的自定义加速器时,只能实例化一定数量的 PE。虽然大多数带宽受限的设计声称其峰值性能为每秒几百千兆操作,但 由于资源利用率和算术强度低,当应用于最先进的 CNN(如 AlexNet、VGGNet 和 ResNet)时,它们的平均运行时性能大大低于它们的屋顶线。在这项工作中,我们提出了一种零激活跳过卷积加速器(ZASCA),它避免了具有零值激活的非贡献乘法。ZASCA 采用了一种数据流,可最大限度地减少其平均和峰值性能之间的差距,同时最大限度地提高其对激活的稀疏和密集表示的算术强度,针对带宽有限的边缘计算场景。更确切地说,ZASCA 在一组最先进的 CNN 上实现了高达 94% 的性能效率,用于具有密集表示的图像分类,其中性能效率是平均运行时性能与峰值性能之间的比率。使用其零跳跃特性,ZASCA 可以进一步将最先进的 CNN 的性能效率提高多达 1.9 倍,具体取决于激活的稀疏程度。在 65-nm TSMC CMOS 技术中的实现结果表明,与最节能的加速器相比,ZASCA 可以将卷积处理速度提高 5.5 倍到 17.5 倍,并且能效提高 2.1 倍到 4.5 倍,同时占用的空间减少 2.1 倍硅区。根据激活的稀疏程度,ZASCA 可以进一步将最先进的 CNN 的性能效率提高多达 1.9 倍。在 65-nm TSMC CMOS 技术中的实现结果表明,与最节能的加速器相比,ZASCA 可以将卷积处理速度提高 5.5 倍到 17.5 倍,并且能效提高 2.1 倍到 4.5 倍,同时占用的空间减少 2.1 倍硅区。根据激活的稀疏程度,ZASCA 可以进一步将最先进的 CNN 的性能效率提高多达 1.9 倍。在 65-nm TSMC CMOS 技术中的实现结果表明,与最节能的加速器相比,ZASCA 可以将卷积处理速度提高 5.5 倍到 17.5 倍,并且能效提高 2.1 倍到 4.5 倍,同时占用的空间减少 2.1 倍硅区。
更新日期:2020-01-01
down
wechat
bug