Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing,IEEE Micro

当前位置： X-MOL 学术 › IEEE Micro › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing
IEEE Micro ( IF 2.8 ) Pub Date : 2020-12-17 , DOI: 10.1109/mm.2020.3045564
Florian Zaruba ₁ , Fabian Schuiki ₁ , Luca Benini ₂

Affiliation

Data-parallel problems demand ever growing floating-point (FP) operations per second under tight area- and energy-efficiency constraints. In this work, we present Manticore, a general-purpose, ultraefficient chiplet-based architecture for data-parallel FP workloads. We have manufactured a prototype of the chiplet’s computational core in Globalfoundries 22FDX process and demonstrate more than 5x improvement in energy efficiency on FP intensive workloads compared to CPUs and GPUs. The compute capability at high energy and area efficiency is provided in “Snitch: A tiny pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads,” IEEE Trans. Comput., containing eight small integer cores, each controlling a large floating-point unit (FPU). The core supports two custom ISA extensions: The SSRs extension elides explicit load and store instructions by encoding them as register reads and writes (“Stream semantic registers: A lightweight RISC-V ISA extension achieving full compute utilization in single-issue cores,” IEEE Trans. Comput.). The floating-point repetition extension decouples the integer core from the FPU allowing floating-point instructions to be issued independently. These two extensions allow the single-issue core to minimize its instruction fetch bandwidth and saturate the instruction bandwidth of the FPU, achieving FPU utilization above 90%, with more than 40% of core area dedicated to the FPU.

中文翻译：

Manticore：用于超高效浮点计算的 4096 核 RISC-V 小芯片架构

在严格的面积和能源效率限制下，数据并行问题需要每秒不断增长的浮点 (FP) 运算。在这项工作中，我们提出了 Manticore，这是一种通用、超高效的基于小芯片的架构，适用于数据并行 FP 工作负载。我们采用 Globalfoundries 22FDX 工艺制造了小芯片计算核心的原型，并证明与 CPU 和 GPU 相比，FP 密集型工作负载的能效提高了 5 倍以上。 IEEE Trans 的“Snitch：用于浮点密集型工作负载的面积和能源高效执行的微型伪双核处理器”提供了高能源和面积效率的计算能力。计算，包含八个小型整数核心，每个核心控制一个大型浮点单元（FPU）。该内核支持两种自定义 ISA 扩展：SSR 扩展通过将显式加载和存储指令编码为寄存器读取和写入来消除显式加载和存储指令（“流语义寄存器：轻量级 RISC-V ISA 扩展在单问题内核中实现完全计算利用”，IEEE计算。）。浮点重复扩展将整数核心与 FPU 解耦，允许独立发出浮点指令。这两个扩展使得单发出核能够最小化其取指令带宽并使FPU的指令带宽饱和，实现FPU利用率超过90%，其中超过40%的核心区域专用于FPU。

更新日期：2020-12-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11