当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache
ACM Transactions on Architecture and Code Optimization ( IF 1.5 ) Pub Date : 2021-05-10 , DOI: 10.1145/3449043
Michael Stokes 1 , David Whalley 1 , Soner Onder 2
Affiliation  

While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.

中文翻译:

降低未命中率并消除数据过滤器缓存的性能损失

虽然数据过滤器缓存 (DFC) 已被证明可有效降低数据访问能量,但由于高 DFC 未命中率导致相关的性能损失,它们尚未在处理器中采用。在本文中,我们提出了一种设计,该设计既可以降低 DFC 未命中率,又可以完全消除 DFC 性能损失,即使对于具有单周期访问时间的一级数据缓存 (L1 DC) 也是如此。首先,我们展示了仅在引用 L1 DC 的 DFC 行中的每个单词时才延迟填充 DFC 的 DFC 比急切地填充整个 DFC 行更节能。对于 512B DFC,我们能够消除 DFC 中在被驱逐之前从未引用过的大量单词,这发生在 32B 行中大约 75% 的单词中。第二,我们证明了懒惰字填充的 DFC 线路可以有效地共享和打包来自多条 L1 DC 线路的数据字,以降低 DFC 未命中率。对于 512B DFC,我们在大约 23% 的时间内完全避免访问 L1 DC 的负载,并在 50% 的时间避免完全关联的 L1 DC 访问负载,其中 DFC 只需要 L1 大小的大约 2.5%直流。最后,我们提出了一种方法,该方法通过提前推测性地执行 DFC 标签检查并且仅在保证命中时才访问 DFC 数据来完全消除 DFC 性能损失。对于 512B DFC,我们将 DTLB 和 L1 DC 的数据访问能耗提高了 33%,而性能没有下降。我们完全避免在大约 23% 的时间内访问 L1 DC,并在 50% 的时间内避免完全关联的 L1 DC 访问,而 DFC 只需要 L1 DC 大小的大约 2.5%。最后,我们提出了一种方法,该方法通过提前推测性地执行 DFC 标签检查并且仅在保证命中时才访问 DFC 数据来完全消除 DFC 性能损失。对于 512B DFC,我们将 DTLB 和 L1 DC 的数据访问能耗提高了 33%,而性能没有下降。我们完全避免在大约 23% 的时间内访问 L1 DC,并在 50% 的时间内避免完全关联的 L1 DC 访问,而 DFC 只需要 L1 DC 大小的大约 2.5%。最后,我们提出了一种方法,该方法通过提前推测性地执行 DFC 标签检查并且仅在保证命中时才访问 DFC 数据来完全消除 DFC 性能损失。对于 512B DFC,我们将 DTLB 和 L1 DC 的数据访问能耗提高了 33%,而性能没有下降。
更新日期:2021-05-10
down
wechat
bug