当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores
arXiv - CS - Hardware Architecture Pub Date : 2019-11-19 , DOI: arxiv-1911.08356
Fabian Schuiki, Florian Zaruba, Torsten Hoefler, Luca Benini

Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11% increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100% in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.

中文翻译:

流语义寄存器:一个轻量级的 RISC-V ISA 扩展,在单问题内核中实现完全计算利用

单问题处理器内核非常节能,但受到冯诺依曼瓶颈的影响,因为它们必须明确地获取和发出为 ALU/FPU 供电所需的负载/存储。用于移动数据的每条指令都是一个未用于计算的周期,将 ALU/FPU 利用率限制为 33%。我们建议使用“流语义寄存器”来提高利用率并提高能源效率。SSR 是一种轻量级、非侵入性的 RISC-V ISA 扩展,它将内存访问隐式编码为寄存器读/写,从而消除了大量的加载/存储。我们在现有多核集群的 RTL 中实现了提议的扩展,并综合了现代 22 纳米技术的设计。我们的扩展提供了跨不同内核的 2 到 5 倍的显着架构加速,而核心面积仅增加了 11%。顺序代码在单个内核上的运行速度提高了 3 倍,并且集群中实现相同性能所需的内核数量减少了 3 倍。利用率提高到几乎 100%,从而使多核集群的能效提高了 2 倍。该扩展将指令提取减少了 3.5 倍,指令缓存功耗减少了 5.6 倍。编译器可以自动将循环嵌套映射到 SSR,使更改对程序员透明。
更新日期:2020-04-02
down
wechat
bug