当前位置: X-MOL 学术IEEE J. Solid-State Circuits › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator
IEEE Journal of Solid-State Circuits ( IF 5.4 ) Pub Date : 2020-04-01 , DOI: 10.1109/jssc.2019.2960480
Dong-Hyeon Park , Subhankar Pal , Siying Feng , Paul Gao , Jielun Tan , Austin Rovinski , Shaolin Xie , Chun Zhao , Aporva Amarnath , Timothy Wesley , Jonathan Beaumont , Kuan-Yu Chen , Chaitali Chakrabarti , Michael Bedford Taylor , Trevor Mudge , David Blaauw , Hun-Seok Kim , Ronald G. Dreslinski

A sparse matrix–matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm $\times $ 2.6-mm chip exhibits 12.6 $\times $ (8.4 $\times $ ) energy efficiency gain, 11.7 $\times $ (77.6 $\times $ ) off-chip bandwidth efficiency gain, and 17.1 $\times $ (36.9 $\times $ ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.

中文翻译:

7.3 M 输出非零/J、11.7 M 输出非零/GB 可重构稀疏矩阵-矩阵乘法加速器

具有 48 个异构内核和可重构存储器层次结构的稀疏矩阵-矩阵乘法 (SpMM) 加速器采用 40-nm CMOS 制造。计算结构由专用浮点乘法单元和通用 Arm Cortex-M0 和 Cortex-M4 内核组成。片上存储器根据算法的阶段重新配置暂存器或缓存。内存和计算单元通过可合成的合并交叉开关互连,以实现高效的内存访问。2.0 毫米 $\times $ 2.6 毫米芯片显示 12.6 $\times $ (8.4 $\times $ ) 能效增益,11.7 $\times $ (77.6 $\times $ ) 片外带宽效率增益,以及 17.1 $\times $ (36.9 $\times $ ) 计算密度增益 s 相对于高端 CPU (GPU) 跨各种合成和现实世界的基于幂律图的稀疏矩阵。
更新日期:2020-04-01
down
wechat
bug