Breaking Barriers: Maximizing Array Utilization for Compute In-Memory Fabrics,arXiv - CS - Emerging Technologies

当前位置： X-MOL 学术 › arXiv.cs.ET › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Breaking Barriers: Maximizing Array Utilization for Compute In-Memory Fabrics
arXiv - CS - Emerging Technologies Pub Date : 2020-08-15 , DOI: arxiv-2008.06741
Brian Crafton and Samuel Spetalnick and Gauthaman Murali and Tushar Krishna and Sung-Kyu Lim and Arijit Raychowdhury

Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or phase change random access memory (PCRAM), various forms of neural networks can be implemented to greatly reduce power and increase on chip memory capacity. However, compute in-memory faces its own limitations at both the circuit and the device levels. Although compute in-memory using the crossbar architecture can greatly reduce data transport, the rigid nature of these large fixed weight matrices forfeits the flexibility of traditional CMOS and SRAM based designs. In this work, we explore the different synchronization barriers that occur from the CIM constraints. Furthermore, we propose a new allocation algorithm and data flow based on input data distributions to maximize utilization and performance for compute-in memory based designs. We demonstrate a 7.47$\times$ performance improvement over a naive allocation method for CIM accelerators on ResNet18.

中文翻译：

打破障碍：最大化计算内存结构的阵列利用率

内存计算 (CIM) 是一种很有前途的技术，可以最大限度地减少数据传输、大多数数据密集型应用程序的主要性能瓶颈和能源成本。这已在加速机器学习应用程序的神经网络中得到广泛采用。利用具有新兴非易失性存储器 (eNVM) 的交叉架构，例如密集电阻随机存取存储器 (RRAM) 或相变随机存取存储器 (PCRAM)，可以实施各种形式的神经网络，以大大降低功耗并增加片上存储器容量。然而，内存计算在电路和设备层面都面临着自身的局限性。虽然使用 crossbar 架构的内存计算可以大大减少数据传输，这些大型固定权重矩阵的刚性特性丧失了传统基于 CMOS 和 SRAM 的设计的灵活性。在这项工作中，我们探索了由 CIM 约束产生的不同同步障碍。此外，我们提出了一种基于输入数据分布的新分配算法和数据流，以最大限度地提高基于内存计算的设计的利用率和性能。我们展示了比 ResNet18 上 CIM 加速器的简单分配方法 7.47$\times$ 的性能改进。

更新日期：2020-08-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>