当前位置: X-MOL 学术ACM J. Emerg. Technol. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CLU
ACM Journal on Emerging Technologies in Computing Systems ( IF 2.1 ) Pub Date : 2021-04-15 , DOI: 10.1145/3427472
Palash Das 1 , Hemangee K. Kapoor 1
Affiliation  

Convolutional/Deep Neural Networks (CNNs/DNNs) are rapidly growing workloads for the emerging AI-based systems. The gap between the processing speed and the memory-access latency in multi-core systems affects the performance and energy efficiency of the CNN/DNN tasks. This article aims to alleviate this gap by providing a simple and yet efficient near-memory accelerator-based system that expedites the CNN inference. Towards this goal, we first design an efficient parallel algorithm to accelerate CNN/DNN tasks. The data is partitioned across the multiple memory channels (vaults) to assist in the execution of the parallel algorithm. Second, we design a hardware unit, namely the convolutional logic unit (CLU), which implements the parallel algorithm. To optimize the inference, the CLU is designed, and it works in three phases for layer-wise processing of data. Last, to harness the benefits of near-memory processing (NMP), we integrate homogeneous CLUs on the logic layer of the 3D memory, specifically the Hybrid Memory Cube (HMC). The combined effect of these results in a high-performing and energy-efficient system for CNNs/DNNs. The proposed system achieves a substantial gain in the performance and energy reduction compared to multi-core CPU- and GPU-based systems with a minimal area overhead of 2.37%.

中文翻译:

CLU

卷积/深度神经网络 (CNN/DNN) 是新兴的基于 AI 系统的快速增长的工作负载。多核系统中处理速度和内存访问延迟之间的差距会影响 CNN/DNN 任务的性能和能效。本文旨在通过提供一个简单而高效的基于近存储器加速器的系统来加速 CNN 推理,从而缓解这一差距。为实现这一目标,我们首先设计了一种高效的并行算法来加速 CNN/DNN 任务。数据在多个内存通道(保险库)中进行分区,以帮助执行并行算法。其次,我们设计了一个硬件单元,即卷积逻辑单元(CLU),它实现了并行算法。为了优化推理,设计了 CLU,它分三个阶段进行数据的分层处理。最后,为了利用近内存处理 (NMP) 的优势,我们将同构 CLU 集成到 3D 内存的逻辑层,特别是混合内存立方体 (HMC)。这些因素的综合作用为 CNNs/DNNs 提供了一个高性能和高能效的系统。与基于多核 CPU 和 GPU 的系统相比,所提出的系统在性能和能耗方面取得了显着的进步,最小的面积开销为 2.37%。
更新日期:2021-04-15
down
wechat
bug