当前位置: X-MOL 学术Nat. Electron. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices
Nature Electronics ( IF 33.7 ) Pub Date : 2020-12-14 , DOI: 10.1038/s41928-020-00505-5
Cheng-Xin Xue , Yen-Cheng Chiu , Ta-Wei Liu , Tsung-Yuan Huang , Je-Syu Liu , Ting-Wei Chang , Hui-Yao Kao , Jing-Hong Wang , Shih-Ying Wei , Chun-Ying Lee , Sheng-Po Huang , Je-Min Hung , Shih-Hsih Teng , Wei-Chen Wei , Yi-Ren Chen , Tzu-Hsiang Hsu , Yen-Kai Chen , Yun-Chen Lo , Tai-Hsing Wen , Chung-Chuan Lo , Ren-Shuo Liu , Chih-Cheng Hsieh , Kea-Tiong Tang , Mon-Shu Ho , Chin-Yi Su , Chung-Cheng Chou , Yu-Der Chih , Meng-Fan Chang

The development of small, energy-efficient artificial intelligence edge devices is limited in conventional computing architectures by the need to transfer data between the processor and memory. Non-volatile compute-in-memory (nvCIM) architectures have the potential to overcome such issues, but the development of high-bit-precision configurations required for dot-product operations remains challenging. In particular, input–output parallelism and cell-area limitations, as well as signal margin degradation, computing latency in multibit analogue readout operations and manufacturing challenges, still need to be addressed. Here we report a 2 Mb nvCIM macro (which combines memory cells and related peripheral circuitry) that is based on single-level cell resistive random-access memory devices and is fabricated in a 22 nm complementary metal–oxide–semiconductor foundry process. Compared with previous nvCIM schemes, our macro can perform multibit dot-product operations with increased input–output parallelism, reduced cell-array area, improved accuracy, and reduced computing latency and energy consumption. The macro can, in particular, achieve latencies between 9.2 and 18.3 ns, and energy efficiencies between 146.21 and 36.61 tera-operations per second per watt, for binary and multibit input–weight–output configurations, respectively.



中文翻译:

基于AI边缘设备的基于电阻型随机存取存储器的CMOS集成内存中计算宏

小型,高能效的人工智能边缘设备的开发由于需要在处理器和内存之间传输数据而在常规计算架构中受到限制。非易失性内存计算(nvCIM)架构有可能克服这些问题,但点积运算所需的高精度配置的开发仍然具有挑战性。特别是,输入输出并行性和单元区域限制,以及信号余量降低,多位模拟读出操作中的计算延迟以及制造挑战仍然需要解决。在这里,我们报告一个2 Mb nvCIM宏(它结合了存储单元和相关的外围电路),该宏基于单级单元电阻式随机存取存储设备,并以22 nm互补金属-氧化物-半导体铸造工艺制造。与以前的nvCIM方案相比,我们的宏可以执行多位点积运算,具有更高的输入输出并行度,减小的单元阵列面积,更高的精度以及减少的计算延迟和能耗。对于二进制和多位输入-重量-输出配置,该宏尤其可以实现9.2到18.3 ns的延迟以及每秒每瓦的146.21到36.61兆操作数的能量效率。我们的宏可以执行多位点积运算,从而提高了输入输出并行度,减小了单元阵列的面积,提高了精度并减少了计算延迟和能耗。对于二进制和多位输入-重量-输出配置,该宏尤其可以实现9.2到18.3 ns的延迟以及每秒每瓦的146.21到36.61兆操作数的能量效率。我们的宏可以执行多位点积运算,从而提高了输入输出并行度,减小了单元阵列的面积,提高了精度并减少了计算延迟和能耗。对于二进制和多位输入-重量-输出配置,该宏尤其可以实现9.2到18.3 ns的延迟以及每秒每瓦的146.21到36.61兆操作数的能量效率。

更新日期:2020-12-14
down
wechat
bug