Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-01-20 , DOI: 10.1145/3431803
Wenjie Liu ₁ , Shoaib Akram ₂ , Jennifer B. Sartor ₁ , Lieven Eeckhout ₁

Affiliation

Emerging workloads in cloud and data center infrastructures demand high main memory bandwidth and capacity. Unfortunately, DRAM alone is unable to satisfy contemporary main memory demands. High-bandwidth memory (HBM) uses 3D die-stacking to deliver 4–8× higher bandwidth. HBM has two drawbacks: (1) capacity is low, and (2) soft error rate is high. Hybrid memory combines DRAM and HBM to promise low fault rates, high bandwidth, and high capacity. Prior OS approaches manage HBM by mapping pages to HBM versus DRAM based on hotness (access frequency) and risk (susceptibility to soft errors). Unfortunately, these approaches operate at a coarse-grained page granularity, and frequent page migrations hurt performance. This article proposes a new class of reliability-aware garbage collectors for hybrid HBM-DRAM systems that place hot and low-risk objects in HBM and the rest in DRAM. Our analysis of nine real-world Java workloads shows that: (1) newly allocated objects in the nursery are frequently written, making them both hot and low-risk, (2) a small fraction of the mature objects are hot and low-risk, and (3) allocation site is a good predictor for hotness and risk. We propose RiskRelief, a novel reliability-aware garbage collector that uses allocation site prediction to place hot and low-risk objects in HBM. Allocation sites are profiled offline and RiskRelief uses heuristics to classify allocation sites as DRAM and HBM. The proposed heuristics expose Pareto-optimal trade-offs between soft error rate (SER) and execution time. RiskRelief improves SER by 9× compared to an HBM-Only system while at the same time improving performance by 29% compared to a DRAM-Only system. Compared to a state-of-the-art OS approach for reliability-aware data placement, RiskRelief eliminates all page migration overheads, which substantially improves performance while delivering similar SER. Reliability-aware garbage collection opens up a new opportunity to manage emerging HBM-DRAM memories at fine granularity while requiring no extra hardware support and leaving the programming model unchanged.

中文翻译：

混合 HBM-DRAM 内存的可靠性感知垃圾收集

云和数据中心基础设施中的新兴工作负载需要高主存带宽和容量。不幸的是，仅 DRAM 无法满足当代主存的需求。高带宽内存 (HBM) 使用 3D 芯片堆叠来提供 4–8 倍更高的带宽。HBM有两个缺点：（1）容量低，（2）软错误率高。混合内存结合了 DRAM 和 HBM，以保证低故障率、高带宽和高容量。先前的操作系统方法通过根据热度（访问频率）和风险（对软错误的敏感性）将页面映射到 HBM 与 DRAM 来管理 HBM。不幸的是，这些方法以粗粒度的页面粒度运行，频繁的页面迁移会损害性能。本文提出了一类新的可靠性感知垃圾收集器，用于混合 HBM-DRAM 系统，将热和低风险对象放置在 HBM 中，其余对象放置在 DRAM 中。我们对九个真实世界的 Java 工作负载的分析表明：（1）在 Nursery 中新分配的对象被频繁写入，使得它们既热又低风险，（2）一小部分成熟对象是热的和低风险的, (3) 分配地点是热度和风险的良好预测指标。我们提出了 RiskRelief，一种新颖的可靠性感知垃圾收集器，它使用分配站点预测将热和低风险对象放置在 HBM 中。分配站点离线分析，RiskRelief 使用启发式方法将分配站点分类为 DRAM 和 HBM。所提出的启发式方法揭示了软错误率 (SER) 和执行时间之间的帕累托最优权衡。与仅 HBM 系统相比，RiskRelief 将 SER 提高了 9 倍，同时与仅 DRAM 系统相比，性能提高了 29%。与用于可靠性感知数据放置的最先进操作系统方法相比，RiskRelief 消除了所有页面迁移开销，从而在提供类似 SER 的同时显着提高了性能。可靠性感知垃圾收集开辟了一个新的机会，可以细粒度地管理新兴的 HBM-DRAM 存储器，同时不需要额外的硬件支持，并且保持编程模型不变。这大大提高了性能，同时提供了类似的 SER。可靠性感知垃圾收集开辟了一个新的机会，可以细粒度地管理新兴的 HBM-DRAM 存储器，同时不需要额外的硬件支持，并且保持编程模型不变。这大大提高了性能，同时提供了类似的 SER。可靠性感知垃圾收集开辟了一个新的机会，可以细粒度地管理新兴的 HBM-DRAM 存储器，同时不需要额外的硬件支持，并且保持编程模型不变。

更新日期：2021-01-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>