MultPIM: Fast Stateful Multiplication for Processing-in-Memory,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MultPIM: Fast Stateful Multiplication for Processing-in-Memory
arXiv - CS - Hardware Architecture Pub Date : 2021-08-30 , DOI: arxiv-2108.13378
Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the state-of-the-art algorithm for stateful single-row multiplication by using memristive partitions, reducing the latency of the previous state-of-the-art by 5.1x. In this paper, we begin by proposing novel partition-based computation techniques for broadcasting and shifting data. Then, we design an in-memory multiplication algorithm based on the carry-save add-shift (CSAS) technique. Finally, we detail specific logic optimizations to the algorithm that further reduce latency. These contributions constitute MultPIM, a multiplier that reduces state-of-the-art time complexity from quadratic to linear-log. For 32-bit numbers, MultPIM improves latency by an additional 3.8x over RIME, while even slightly reducing area overhead. Furthermore, we optimize MultPIM for full-precision matrix-vector multiplication and demonstrate 22.0x latency improvement over FloatPIM matrix-vector multiplication.

中文翻译：

MultPIM：用于内存中处理的快速状态乘法

内存处理 (PIM) 试图使用支持存储和逻辑的设备来消除计算/内存数据传输。IMPLY、MAGIC 和 FELIX 等有状态逻辑技术可以在具有大规模并行性的忆阻交叉阵列中执行逻辑门。由于影响广泛，通过有状态逻辑的乘法是一个活跃的研究领域。最近，RIME 通过使用忆阻分区成为有状态单行乘法的最先进算法，将之前最先进技术的延迟降低了 5.1 倍。在本文中，我们首先提出用于广播和移动数据的新的基于分区的计算技术。然后，我们设计了一种基于进位保存加移位（CSAS）技术的内存乘法算法。最后，我们详细介绍了算法的特定逻辑优化，以进一步减少延迟。这些贡献构成了 MultPIM，这是一个将最先进的时间复杂度从二次方降低到线性对数的乘法器。对于 32 位数字，MultPIM 比 RIME 将延迟提高了 3.8 倍，同时甚至略微减少了区域开销。此外，我们针对全精度矩阵向量乘法优化了 MultPIM，并展示了 FloatPIM 矩阵向量乘法的 22.0 倍延迟改进。

更新日期：2021-08-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>