当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Key-Value Lookups with Node Tracker
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-06-08 , DOI: 10.1145/3452099
Mustafa Cavus 1 , Mohammed Shatnawi 1 , Resit Sendag 1 , Augustus K. Uht 1
Affiliation  

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

中文翻译:

使用节点跟踪器进行快速键值查找

内存数据库的查找操作受大量内存限制,因为它们通常依赖于指针追踪链接数据结构遍历。由于随机键查找,它们还有许多难以预测的分支。在这项研究中,我们表明虽然缓存未命中是这些应用程序的主要瓶颈,但如果没有消除分支错误预测的方法,仅通过预取就只能实现一小部分性能优势。我们提出了节点跟踪器(NT),这是一种新颖的可编程预取器/预执行单元,它在利用键间查找并行性来提高单线程性能方面非常有效。我们使用分支结果流 (BOS) 扩展 NT 以减少分支错误预测,并表明这实现了额外的 3 倍加速。最后,我们将 NT 评估为预执行单元,并证明我们可以进一步提高单线程和多线程执行模式的性能。我们的结果表明,在用作预取器时,NT 平均将单线程性能提高了 4.1 倍;11.9× 作为带有 BOS 的预取器;14.9×作为预执行单元,18.8×作为带有BOS的预执行单元。最后,使用后一个版本的 24 核,我们分别在单核和 24 核基线上实现了 203 倍和 11 倍的加速。
更新日期:2021-06-08
down
wechat
bug