当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Micro BTB: A High Performance and Lightweight Last-Level Branch Target Buffer for Servers
arXiv - CS - Hardware Architecture Pub Date : 2021-06-08 , DOI: arxiv-2106.04205
Vishal GuptaIndian Institute of Technology, Kanpur, Biswabandan PandaIndian Institute of Technology, Bombay

High-performance branch target buffers (BTBs) and the L1I cache are key to high-performance front-end. Modern branch predictors are highly accurate, but with an increase in code footprint in modern-day server workloads, BTB and L1I misses are still frequent. Recent industry trend shows usage of large BTBs (100s of KB per core) that provide performance closer to the ideal BTB along with a decoupled front-end that provides efficient fetch-directed L1I instruction prefetching. On the other hand, techniques proposed by academia, like BTB prefetching and using retire order stream for learning, fail to provide significant performance with modern-day processor cores that are deeper and wider. We solve the problem fundamentally by increasing the storage density of the last-level BTB. We observe that not all branch instructions require a full branch target address. Instead, we can store the branch target as a branch offset, relative to the branch instruction. Using branch offset enables the BTB to store multiple branches per entry. We reduce the BTB storage in half, but we observe that it increases skewness in the BTB. We propose a skewed indexed and compressed last-level BTB design called MicroBTB (MBTB) that stores multiple branches per BTB entry. We evaluate MBTB on 100 industry-provided server workloads. A 4K-entry MBTB provides 17.61% performance improvement compared to an 8K-entry baseline BTB design with a storage savings of 47.5KB per core.

中文翻译:

Micro BTB:用于服务器的高性能轻量级末级分支目标缓冲区

高性能分支目标缓冲区 (BTB) 和 L1I 缓存是高性能前端的关键。现代分支预测器非常准确,但随着现代服务器工作负载中代码占用量的增加,BTB 和 L1I 未命中仍然频繁发生。最近的行业趋势表明,使用大型 BTB(每个内核 100 KB)提供更接近理想 BTB 的性能,以及提供高效的提取导向 L1I 指令预取的解耦前端。另一方面,学术界提出的技术,如 BTB 预取和使用退休订单流进行学习,无法为更深、更宽的现代处理器内核提供显着的性能。我们通过提高最后一层BTB的存储密度,从根本上解决了这个问题。我们观察到并非所有分支指令都需要完整的分支目标地址。相反,我们可以将分支目标存储为相对于分支指令的分支偏移量。使用分支偏移使 BTB 能够为每个条目存储多个分支。我们将 BTB 存储减少了一半,但我们观察到它增加了 BTB 的偏度。我们提出了一种称为 MicroBTB (MBTB) 的倾斜索引和压缩的最后一级 BTB 设计,它为每个 BTB 条目存储多个分支。我们在 100 个行业提供的服务器工作负载上评估 MBTB。与 8K 入门基准 BTB 设计相比,4K 入门 MBTB 可提供 17.61% 的性能提升,每个内核可节省 47.5KB 的存储空间。我们将 BTB 存储减少了一半,但我们观察到它增加了 BTB 的偏度。我们提出了一种称为 MicroBTB (MBTB) 的倾斜索引和压缩的最后一级 BTB 设计,它为每个 BTB 条目存储多个分支。我们在 100 个行业提供的服务器工作负载上评估 MBTB。与 8K 入门基准 BTB 设计相比,4K 入门 MBTB 可提供 17.61% 的性能提升,每个核心可节省 47.5KB 的存储空间。我们将 BTB 存储减少了一半,但我们观察到它增加了 BTB 的偏度。我们提出了一种称为 MicroBTB (MBTB) 的倾斜索引和压缩的最后一级 BTB 设计,它为每个 BTB 条目存储多个分支。我们在 100 个行业提供的服务器工作负载上评估 MBTB。与 8K 入门基准 BTB 设计相比,4K 入门 MBTB 可提供 17.61% 的性能提升,每个内核可节省 47.5KB 的存储空间。
更新日期:2021-06-09
down
wechat
bug