当前位置: X-MOL 学术arXiv.cs.GR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels
arXiv - CS - Graphics Pub Date : 2020-11-04 , DOI: arxiv-2011.02368
Nilanjan Goswami, Amer Qouneh, Chao Li, Tao Li

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with concurrent kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures. Also, we demonstrate a multi-kernel throughput benchmark suite based on the framework that encapsulates symmetric, asymmetric and co-existing (often appears together) kernel based workloads. On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32%, 26% and 33% in GTX470, Tesla M2050 and Tesla K20 across 12 benchmarks. Concurrency and enhanced utilization are often correlated but do not imply significant deviation in power dissipation. Diversity analysis of proposed multi-kernels confirms characteristic variation and power-profile diversity within the suite. Besides, we explain several findings regarding power-performance co-optimization of concurrent throughput workloads.

中文翻译:

并行 GPU 内核功耗性能表征的一种经验和统计方法

数据中心中功率和节能吞吐量加速器 (GPU) 的日益部署需要增强 GPU 的功率性能协同优化能力。使用加速器实现百亿亿次计算需要进一步提高能效。通过在加速器中启用硬连线内核并发,工作负载间和工作负载内同时内核计算可以预测以较低能量预算增加吞吐量。为了提高架构的每瓦性能指标,需要对现实世界的吞吐量工作负载(具有并发内核执行)进行系统的实证研究。为此,我们提出了一个多内核吞吐量工作负载生成框架,该框架将促进百亿亿级数据中心的积极能源和性能管理,并将刺激吞吐量架构的协同功率性能协同优化。此外,我们展示了一个基于框架的多内核吞吐量基准测试套件,该框架封装了对称、非对称和共存(通常一起出现)基于内核的工作负载。平均而言,我们的分析表明,吞吐量架构中内核执行中的空间和时间并发在 GTX470、Tesla M2050 和 Tesla K20 的 12 个基准测试中分别节省了 32%、26% 和 33% 的能耗。并发性和增强的利用率通常是相关的,但并不意味着功耗的显着偏差。提议的多内核的多样性分析证实了套件内的特征变化和功率分布多样性。此外,我们解释了有关并发吞吐量工作负载的电源性能协同优化的几个发现。
更新日期:2020-11-06
down
wechat
bug