当前位置: X-MOL 学术ACM Trans. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis
ACM Transactions on Computer Systems ( IF 1.5 ) Pub Date : 2016-04-07 , DOI: 10.1145/2851503
Michael Badamo 1 , Jeff Casarona 1 , Minshu Zhao 1 , Donald Yeung 1
Affiliation  

To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies. A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption. This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.

中文翻译:

通过重用距离分析识别节能的多核缓存层次结构

为了以节能的方式提高性能,计算机架构师一直在构建利用更多线程级并行性的 CPU。此类 CPU 的一个关键考虑因素是正确设计片上高速缓存层次结构。不幸的是,这很难做到,尤其是对于具有高核心数和大量缓存的 CPU。由组织高速缓存层次结构的多种方式组合形成的巨大设计空间使得识别节能配置变得困难。此外,作为进行此类设计空间研究的主要手段的建筑模拟速度缓慢,加剧了这一问题。可以帮助架构师优化 CPU 缓存层次结构的强大工具是重用距离 (RD) 分析。最近的工作扩展了单处理器 RD 技术——即,通过引入并发 RD 和私有堆栈 RD 分析——能够分析多核 CPU 中不同类型的缓存。一旦获得,并行位置配置文件可以预测大量缓存配置的性能,从而实现高效的设计空间探索。迄今为止,关于多核 RD 分析的现有工作主要集中在开发分析技术和评估其准确性上。不幸的是,目前还没有使用 RD 分析来优化 CPU 性能或功耗的工作。本文研究应用多核 RD 分析来确定多核 CPU 的最节能缓存配置。首先,我们开发了分析模型,该模型使用来自并行位置配置文件的缓存未命中计数来估计 CPU 性能和功耗。尽管未来的可扩展 CPU 可能会采用多线程(甚至无序)内核,但我们目前的研究假设单线程有序内核来简化模型,从而使我们能够专注于缓存层次结构和基于 RD 的技术。其次,为了展示我们技术的实用性,我们应用我们的模型来优化具有两级缓存层次结构的大规模平铺 CPU 架构。我们展示了最节能的配置在不同的基准测试中差异很大,并且我们的位置配置文件提供了对某些配置为何节能的深入见解。我们还表明,选择最佳配置可以带来显着的收益,因为在我们的平铺 CPU 设计空间中,能效提高了 2.01 倍。最后,我们使用详细的模拟验证了我们的技术的准确性。在几种模拟配置中,我们的技术通常可以选择最省电的配置,或者非常接近最佳配置的配置。此外,在所有模拟配置中,我们可以以 15.2% 的误差预测功率效率。
更新日期:2016-04-07
down
wechat
bug