当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication
arXiv - CS - Hardware Architecture Pub Date : 2020-09-11 , DOI: arxiv-2009.05334
Andreas Kurth, Wolfgang R\"onninger, Thomas Benz, Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Luca Benini

On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores.

中文翻译:

用于高性能非相干片上通信的开源平台

片上通信基础设施是现代片上系统 (SoC) 的核心组件,随着内核数量、组件异构性以及片上和片外带宽的不断增长,它的重要性不断提高. 对片上网络进行了数十年的研究,使缓存一致的共享内存多处理器成为可能。然而,满足异构众核和加速器丰富的 SoC 需求的通信结构不是或仅部分一致,是一个不太成熟的研究领域。在这项工作中,我们提出了一个模块化的、拓扑无关的、高性能的片上通信平台。该平台包括用于构建和链接具有可定制带宽和并发属性的子网的组件,并遵守最先进的行业标准协议。我们讨论了我们模块的微架构权衡和时序/面积特性,并表明它们可以组合起来构建高带宽(例如,2.5 GHz 和 1024 位数据宽度)端到端片上通信结构(不仅是网络交换机以及 DMA 引擎和内存控制器)具有高度的并发性。我们设计并实现了最先进的 ML 训练加速器,其中我们的通信结构可在一个芯片上扩展到 1024 个内核,在任何两个内核之间仅 24 ns 的往返延迟时提供 32 TB/s 的横截面带宽。
更新日期:2020-09-14
down
wechat
bug