Fast Key-Value Lookups with Node Tracker,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast Key-Value Lookups with Node Tracker
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-06-08 , DOI: 10.1145/3452099
Mustafa Cavus ₁ , Mohammed Shatnawi ₁ , Resit Sendag ₁ , Augustus K. Uht ₁

Affiliation

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

中文翻译：

使用节点跟踪器进行快速键值查找

内存数据库的查找操作受大量内存限制，因为它们通常依赖于指针追踪链接数据结构遍历。由于随机键查找，它们还有许多难以预测的分支。在这项研究中，我们表明虽然缓存未命中是这些应用程序的主要瓶颈，但如果没有消除分支错误预测的方法，仅通过预取就只能实现一小部分性能优势。我们提出了节点跟踪器（NT），这是一种新颖的可编程预取器/预执行单元，它在利用键间查找并行性来提高单线程性能方面非常有效。我们使用分支结果流 (BOS) 扩展 NT 以减少分支错误预测，并表明这实现了额外的 3 倍加速。最后，我们将 NT 评估为预执行单元，并证明我们可以进一步提高单线程和多线程执行模式的性能。我们的结果表明，在用作预取器时，NT 平均将单线程性能提高了 4.1 倍；11.9× 作为带有 BOS 的预取器；14.9×作为预执行单元，18.8×作为带有BOS的预执行单元。最后，使用后一个版本的 24 核，我们分别在单核和 24 核基线上实现了 203 倍和 11 倍的加速。

更新日期：2021-06-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>