当前位置: X-MOL 学术ACM Trans. Embed. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
ACM Transactions on Embedded Computing Systems ( IF 2.8 ) Pub Date : 2020-07-07 , DOI: 10.1145/3396235
Asif Ali Khan 1 , Norman A. Rink 1 , Fazal Hameed 2 , Jeronimo Castrillon 1
Affiliation  

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73%, respectively, compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

中文翻译:

使用 Racetrack 和 DRAM 存储器优化嵌入式设备的张量收缩

张量收缩是许多算法中的基本操作,具有从量子化学到流体动力学和图像处理到机器学习的大量应用。张量计算的性能关键取决于片上/片外存储器的有效利用。在低功耗嵌入式设备的背景下,为了满足能源限制,内存空间的有效管理变得更加重要。这项工作旨在研究嵌入式系统的性能和节能张量收缩策略,使用赛道记忆基于 (RTM)便笺式记忆(SPM) 和基于 DRAM 的片外存储器。编译器优化(例如循环访问顺序和数据布局转换)与架构优化(例如预取和预移位)相结合,用于减少 RTM 中的移位开销。对片外存储器进行优化,例如存储器访问顺序、数据映射和选择合适的存储器访问粒度,以减少片外存储器的争用。实验结果表明,与等容量 SRAM 相比,所提出的优化方案将 SPM 性能和能耗分别提高了 32% 和 73%。由于内存优化,整体 DRAM 动态能耗提高了 80%。
更新日期:2020-07-07
down
wechat
bug