当前位置: X-MOL 学术IEEE Trans. Circuits Syst. I Regul. Pap. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-In-Memory Architectures
IEEE Transactions on Circuits and Systems I: Regular Papers ( IF 5.1 ) Pub Date : 2020-04-01 , DOI: 10.1109/tcsi.2019.2958568
Xiaochen Peng , Rui Liu , Shimeng Yu

Recent state-of-the-art deep convolutional neural networks (CNNs) have shown remarkable success in current intelligent systems for various tasks, such as image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engines based on processing-in-memory (PIM) architecture, where the memory array is used for weighted sum computation, thereby avoiding the frequent data transfer between buffers and computation units. Prior PIM designs typically unroll each 3D kernel of the convolutional layers into a vertical column of a large weight matrix, where the input data needs to be accessed multiple times. In this paper, in order to maximize both weight and input data reuse for PIM architecture, we propose a novel weight mapping method and the corresponding data flow which divides the kernels and assign the input data into different processing-elements (PEs) according to their spatial locations. As a case study, resistive random access memory (RRAM) based 8-bit PIM design at 32 nm is benchmarked. The proposed mapping method and data flow yields $\sim 2.03\times $ speed up and $\sim 1.4\times $ improvement in throughput and energy efficiency for ResNet-34, compared with the prior design based on the conventional mapping method. To further optimize the hardware performance and throughput, we propose an optimal pipeline architecture, with ~50% area overhead, it achieves overall $913\times $ and $1.96\times $ improvement in throughput and energy efficiency, which are 132476 FPS and 20.1 TOPS/W, respectively.

中文翻译:

在内存中处理架构上优化卷积神经网络的权重映射和数据流

最近最先进的深度卷积神经网络 (CNN) 在当前智能系统的各种任务中取得了显着的成功,例如图像/语音识别和分类。最近的一些努力尝试设计基于内存处理 (PIM) 架构的自定义推理引擎,其中内存阵列用于加权和计算,从而避免缓冲区和计算单元之间的频繁数据传输。先前的 PIM 设计通常将卷积层的每个 3D 内核展开到大型权重矩阵的垂直列中,其中需要多次访问输入数据。在本文中,为了最大化 PIM 架构的权重和输入数据重用,我们提出了一种新的权重映射方法和相应的数据流,它划分内核并根据它们的空间位置将输入数据分配到不同的处理元素(PE)。作为案例研究,对基于 32 nm 的电阻式随机存取存储器 (RRAM) 的 8 位 PIM 设计进行了基准测试。建议的映射方法和数据流产生 $\sim 2.03\times $ 加速和 $\sim 1.4\times $ 与基于传统映射方法的先前设计相比,ResNet-34 的吞吐量和能源效率有所提高。为了进一步优化硬件性能和吞吐量,我们提出了一个最优的流水线架构,大约 50% 的面积开销,它实现了整体 $913\次 $ $1.96\times $ 吞吐量和能源效率的提高,分别为 132476 FPS 和 20.1 TOPS/W。
更新日期:2020-04-01
down
wechat
bug