Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-03-01 , DOI: 10.1109/tpds.2020.3028691
Florian Glaser , Giuseppe Tagliavini , Davide Rossi , Germain Haugoug , Qiuting Huang , Luca Benini

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

中文翻译：

共享 L1 内存多处理器集群的节能硬件加速同步

物联网 (IoT) 的终端节点等高度功率和能源受限的处理系统的性能需求急剧增长，导致并行近阈值计算 (NTC)，加入了低能效优势- 具有并联系统典型性能的电压操作。共享 L1 内存多处理器集群是一种很有前途的架构，可提供 GOPS 数量级的性能和超过 100 GOPS/W 的能效。然而，这种计算效率水平只能通过最大化集群中可用处理元素 (PE) 的有效利用来达到。随着这项努力，PE 到 PE 同步和通信的优化是性能的关键因素。在本文中，我们描述了一种用于紧密耦合的处理器集群的轻量级硬件加速同步和通信单元 (SCU)。我们详细介绍了支持细粒度 per-PE 电源管理的架构，并将其集成到 RISC-V 处理器的八核集群中。为了验证所提出的解决方案的有效性，我们采用先进的 22 纳米 FDX 技术实施了八核集群，并使用可调微基准测试和一组实际应用程序和内核来评估性能和能效。提议的解决方案允许无同步区域小至 42 个周期，比基于对 L1 内存的快速测试和设置访问的基线实现小 41 倍，同时将微基准测试限制为 10% 的同步开销。当在现实生活中的 DSP 应用程序上进行评估时，

更新日期：2021-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>