当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture
Big Data Research ( IF 3.3 ) Pub Date : 2021-08-30 , DOI: 10.1016/j.bdr.2021.100267
Lining Yu 1 , Tiezheng Nie 1 , Derong Shen 1 , Yue Kou 1
Affiliation  

Set similarity join (SSJoin) is known as an important operation for searching similarity set pairs from the given database and plays a core role in data integration, data cleaning, and data mining. Different from the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time can be improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, many recent researches have shown that GPUs (Graphics Processing Units) can accelerate and improve the efficiency of similarity join operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with the CPU-GPU architecture. We propose two progressive SSJoin methods, PSSJM and PBM. PSSJM utilizes inverted indexing and PBM achieves its required functions by utilizing the counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our progressive SSJoin method to accelerate the processing. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on CPU-GPU architecture has high speedups over the CPU-base method.



中文翻译:

使用 CPU-GPU 架构加速渐进集相似性连接

集合相似连接(SSJoin)被称为从给定数据库中搜索相似集对的重要操作,在数据集成、数据清理和数据挖掘中起着核心作用。与传统的 SSJoin 方法不同,渐进式 SSJoin 旨在解决大型数据集,从而提高在有限运行时间内查找相似对的效率。Progressive SSJoin 可以在处理过程中尽早提供可能的数据集部分匹配对。此外,最近的许多研究表明,GPU(图形处理单元)可以加速并提高相似连接操作的效率。本文重点探索渐进式 SSJoin 算法并使用 CPU-GPU 架构对其进行加速。我们提出了两种渐进式 SSJoin 方法,PSSJM 和 PBM。PSSJM 使用倒排索引,PBM 使用计数布隆过滤器和前缀过滤技术来实现其所需的功能。此外,我们提出了一种基于渐进式 SSJoin 方法的基于 GPU 的算法来加速处理。对真实世界数据集的综合实验表明,我们的方法可以在有限的时间内产生比传统方法更好的质量结果,并且在 CPU-GPU 架构上实现的方法比基于 CPU 的方法具有更高的速度。

更新日期:2021-09-06
down
wechat
bug