当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient local locking for massively multithreaded in-memory hash-based operators
The VLDB Journal ( IF 2.8 ) Pub Date : 2021-02-11 , DOI: 10.1007/s00778-020-00642-5
Bashar Romanous , Skyler Windh , Ildar Absalyamov , Prerna Budhkar , Robert Halstead , Walid Najjar , Vassilis Tsotras

The join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2\(\times \) and 3.4\(\times \) over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3\(\times \) with a best case of 9.4\(\times \) in terms of throughput over CPU implementations across five types of data distributions.



中文翻译:

高效的本地锁定,适用于大型多线程内存中基于散列的运算符

join和group-by聚合是两个内存密集型运算符,它们会影响关系数据库的性能。散列是用于实现两个运算符的常用方法。最近,多核处理器体系结构发生了范式转变,重新激发了有关联接和分组聚合运营商如何利用这些优势的研究。但是,散列方法的较差的空间局部性已阻碍了多核处理器体系结构的性能,该体系结构依赖于使用大型缓存层次结构来缓解延迟。多线程体系结构可以通过用许多未完成的请求来掩盖内存延迟,从而更好地解决空间不足的问题。但是,即使在最先进的多线程处理器(例如UltraSPARC)中,并行线程的数量也不足以完全覆盖主内存访问延迟。在本文中,我们探索了FPGA的硬件可重新配置性,以实现更深的执行流水线,该流水线可在四个FPGA上维护数百个(而不是数十个)未完成的内存请求,从而极大地提高了并发性和吞吐量。我们为使用FPGA的联接和分组聚合运算符提供了两个端到端的内存加速器。两种加速器都使用大量的多线程来掩盖遍历链表数据结构的长时间存储延迟,同时在本地同时管理四个FPGA上的数百个线程状态。我们探索如何在多线程设计中将内容可寻址存储器混合在一起以充当 我们探索了FPGA的硬件可重新配置性,以实现更深的执行流水线,该流水线可在四个FPGA上维护数百个(而不是数十个)未完成的内存请求,从而极大地提高了并发性和吞吐量。我们为使用FPGA的联接和分组聚合运算符提供了两个端到端的内存中加速器。两种加速器都使用大量的多线程来掩盖遍历链表数据结构的长时间存储延迟,同时在本地并行管理四个FPGA上的数百个线程状态。我们探索如何在多线程设计中将内容可寻址存储器混合在一起以充当 我们探索了FPGA的硬件可重新配置性,以实现更深的执行流水线,该流水线可在四个FPGA上维护数百个(而不是数十个)未完成的内存请求,从而极大地提高了并发性和吞吐量。我们为使用FPGA的联接和分组聚合运算符提供了两个端到端的内存加速器。两种加速器都使用大量的多线程来掩盖遍历链表数据结构的长时间存储延迟,同时在本地同时管理四个FPGA上的数百个线程状态。我们探索如何在多线程设计中将内容可寻址存储器混合在一起以充当 两种加速器都使用大量的多线程来掩盖遍历链表数据结构的长时间存储延迟,同时在本地同时管理四个FPGA上的数百个线程状态。我们探索如何在多线程设计中将内容可寻址存储器混合在一起以充当 两种加速器都使用大量的多线程来掩盖遍历链表数据结构的长时间存储延迟,同时在本地同时管理四个FPGA上的数百个线程状态。我们探索如何在多线程设计中将内容可寻址存储器混合在一起以充当同步缓存,它强制执行锁定并将作业合并到一起,然后再将其写入内存。我们的哈希联接运算符加速器的吞吐量结果显示,与最佳的多核方法相比,在统一数据集和偏斜数据集上具有可比较的内存带宽时,速度提高了2 \(\ times \)和3.4 \(\ times \)。基于散列的分组汇总运算符的加速器证明,利用CAM的平均吞吐量,可以实现3.3 \(\ times \)的最佳加速,在超过五种类型的CPU实现的吞吐量方面,可以达到9.4 \(\ times \)。数据分布。

更新日期:2021-02-11
down
wechat
bug