Elsevier

Information Systems

Volume 89, March 2020, 101485
Information Systems

An empirical evaluation of exact set similarity join techniques using GPUs

https://doi.org/10.1016/j.is.2019.101485Get rights and content

Highlights

  • A thorough evaluation showing the sweet spot of each different technique for exact set similarity joins using a GPU.

  • In large threshold values the sequential CPU techniques are competitive.

  • In lower threshold values, employing parallel GPU techniques seems beneficial.

  • Overall, GPU techniques may perform worse due to the imposed quadratic space overhead.

  • A CPU-GPU co-process scheme performs better in some cases due to efficient workload balance.

Abstract

Exact set similarity join is a notoriously expensive operation, for which several solutions have been proposed. Recently, there have been studies that present a comparative analysis using MapReduce or a non-parallel setting. Our contribution is that we complement these works through conducting a thorough evaluation of the state-of-the-art GPU-enabled techniques. These techniques are highly diverse in their key features and our experiments manage to reveal the key strengths of each one. As we explain, in real-life applications there is no dominant solution. Depending on specific dataset and query characteristics, each solution, even not using the GPU at all, has its own sweet spot. All our work is repeatable and extensible.

Introduction

Given two collections of sets and a threshold, set similarity join is the operation of computing all pairs, the overlap of which exceeds the given threshold. Similarity joins are used in a range of applications, such as plagiarism detection, web crawling, clustering and data mining and have been the subject of extensive research recently, e.g., [1], [2], [3], [4], [5].

In very large datasets, finding similar sets is not trivial. Due to the inherent quadratic complexity, a set similarity join between even medium sized datasets can take hours to complete on a single machine. For example, in the same setting used in our experiments to be presented later, a similarity join over the complete DBLP dataset using a Jaccard threshold of 0.85 takes approximately 8.5 h when employing only a modern CPU. In addition, challenges like high dimensionality, sparsity, unknown data distribution and expensive evaluation arise. To tackle scalability challenges, two main and complementary approaches have been followed. Firstly, to devise sophisticated techniques, which safely prune pairs that cannot meet the threshold as early as possible, typically through simple computations related to the prefix and the suffix of the ordered sets, e.g. [1], [4]. Secondly, to benefit from massive parallelism offered by parallel paradigms such as MapReduce [5], [6], [7], [8] and GPGPU (General-Purpose computing on Graphics Processing Units) [9], [10]. Orthogonally, there exist several proposals that trade accuracy for faster times, such as techniques for approximate set similarity or for nearest neighbor search, e.g., [11], [12], [13] (the detailed discussion of related work is deferred to Section 5). Our work deals with exact set similarities joins exclusively.

Comparative evaluations on the state-of-the-art techniques for set similarity joins that focus on either a MapReduce or a non-parallel single machine setting have recently appeared [1], [14]. The goal of our work is to fill the gap and thoroughly evaluate the state-of-the-art exact set similarity algorithms and techniques in a single machine setup that can benefit from massive parallelism through the usage of a graphics card. Hence, we distinguish our work from [14], where distributed set similarity join algorithms are evaluated. On the other hand, we evaluate techniques, which may use the massive parallelism contrary to the rationale in [1].

Modern GPUs offer a high-parallel environment at low cost. As a result, GPGPU has been introduced to accelerate a large variety of applications [15]. In general, GPGPU takes advantage of the different and complementary characteristics offered by CPUs and GPUs to improve performance. It has been employed in domains like deep learning, bioinformatics, numerical analytics and many others. However, implementing existing algorithms and techniques on a GPU requires in-depth knowledge of the hardware specifications and is often counter-intuitive. In addition, not all tasks are suitable for GPU-side processing. A traditional CPU surpasses at complex branching in application logic, while a GPU is superior at mass parallel execution of simple tasks and floating point operations [16].

In this work, we perform a comprehensive experimental evaluation between GPU-accelerated set similarity joins and CPU standalone implementations using the framework provided in [1]. Moreover, we examine two alternatives (i) transferring the whole workload onto the GPU, or (ii) splitting the workload between the CPU and the GPU and assigning the most suitable tasks to each part. Our findings demonstrate that there does not exist a clear winner among the evaluated techniques; in other words, each alternative has its own sweet spot depending on the data and query characteristics. In summary, the contributions of our work are as follows:

  • To the best of our knowledge, this is the first comprehensive presentation and comparative evaluation of GPU accelerated set similarity joins. In our study, we include the GPU-oriented state-of-the-art techniques.

  • We conduct extensive performance analysis using eight real world datasets. We compare our findings against the state-of-art CPU and GPGPU implementations. We identify the conditions under which each solution becomes the dominant one so that to derive a set of guidelines as to when to use each technique.

  • We provide a repository with all techniques, so that third-part researchers can repeat and extend our work.1

Paper outline. Next, we give an overview of set similarity joins and provide the basic details about the CUDA programming model. We present the-state-of-the-art techniques in Section 3. Our experimental analysis is presented and discussed in Section 4. We provide an overview of the related work in Section 5. We conclude our study and discuss open issues in Section 6.

Section snippets

Background

We introduce the filter-verification framework used by state-of-the-art main memory set similarity join algorithms in line with the comparison work conducted by Mann et al. [1]. We also provide a comprehensive overview of the CUDA programming model, which is proprietary to NVIDIA [17] yet widespread in practice, and explain its main concepts.

Algorithms & techniques

In this section, we review and discuss the algorithms and techniques that are evaluated in this work. First, we summarize the findings of Mann [1] regarding the best three CPU algorithms, which we consider as our baseline. Second, we present three techniques, which employ the GPU to accelerate set similarity joins in a different manner.

Evaluation

The goals of our experimental evaluation are threefold: (i) to show the extent of the achieved speedups between CPU standalone and GPU-accelerated implementations while improving on the CPU should not be taken for granted; (ii) to identify the conditions under which each solution becomes the dominant in practice, and (iii) to provide explanations about the observed behavior.

For ease of presentation, we split the experiments into two parts: the main part, which is adequate to reveal the main

Additional related work

Although extensive research has been carried out on set similarity join for parallel paradigms, such as MapReduce [5], [7], [8], there are few additional studies investigating set similarity join on the GPGPU paradigm, which, however, focus on approximate solutions while we deal with exact ones exclusively.

An early proposal has appeared in [11], according to which Lieberman et al. cast the similarity join operation as a GPU sort-and-search problem. First, they create a set of space-filling

Final discussion

This work summarized and thoroughly evaluated the existing state-of-the-art GPU-based exact set similarity joins. We observe that the main techniques have been proposed in the last few years and have different characteristics, which supports the hypothesis that GPU-enabled similarity joins is still a technology in evolution. More importantly, there is no clear winner, which leaves the question as to whether a globally dominant solution exists open. In Section 4.4.3, we summarize the key

Declaration of Competing Interest

One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.is.2019.101485.

Acknowledgment

The authors gratefully acknowledge the support of NVIDIA, United States Corporation through the donation of the GPU used through the GPU Grant Program.

References (32)

  • RibeiroL.A. et al.

    Prefix filtering to improve set similarity joins

    Inf. Syst.

    (2011)
  • MannW. et al.

    An empirical evaluation of set similarity join techniques

    Proc. VLDB Endow.

    (2016)
  • R.J. Bayardo, Y. Ma, R. Srikant, Scaling up all pairs similarity search, in: Proceedings of the 16th International...
  • BourosP. et al.

    Spatio-textual similarity joins

    PVLDB

    (2012)
  • JiangY. et al.

    String similarity joins: An experimental evaluation

    PVLDB

    (2014)
  • SarmaA.D. et al.

    Clusterjoin: A similarity joins framework using map-reduce

    PVLDB

    (2014)
  • BaragliaR. et al.

    Document similarity self-join with mapreduce

  • VernicaR. et al.

    Efficient parallel set-similarity joins using mapreduce

  • MetwallyA. et al.

    V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors

    PVLDB

    (2012)
  • Ribeiro-JuniorS. et al.

    Fast parallel set similarity joins on many-core architectures

    J. Inf. Data Manag.

    (2017)
  • BellasC. et al.

    Exact set similarity joins for large datasets in the gpgpu paradigm

  • LiebermanM.D. et al.

    A fast similarity join algorithm using graphics processing units

  • CruzM.S. et al.

    Gpu acceleration of set similarity joins

  • JohnsonJ. et al.

    Billion-scale similarity search with gpus

    (2017)
  • FierF. et al.

    Set similarity joins on mapreduce: an experimental survey

    Proc. VLDB Endow.

    (2018)
  • KecklerS.W. et al.

    Gpus and the future of parallel computing

    IEEE Micro

    (2011)
  • Cited by (7)

    • Exploiting GPUs for fast intersection of large sets

      2022, Information Systems
      Citation Excerpt :

      The majority of these studies focus on improving the level of parallelism by reducing redundant comparisons and distributing the workload evenly among GPU threads. Set intersection on GPUs has also been examined in the context of set similarity joins [14]. In this work, we compare and evaluate state-of-the-art GPU techniques for set intersection, when the sets are large, i.e., they may contain up to millions of elements.

    • HySet: A hybrid framework for exact set similarity join using a GPU

      2021, Parallel Computing
      Citation Excerpt :

      Both these threads run in parallel until the whole process is finished. By using the findings of [11] as a baseline, we transit to a hybrid solution more conveniently. More specifically, since we have a good overview on each technique’s strong and weak points, we can select the best performing ones per scenario and adapt them to our work allocation strategies.

    • A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

      2023, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
    • An efficient length-segmented inverted index-based set similarity query algorithm

      2022, International Journal of Computing Science and Mathematics
    View all citing articles on Scopus
    View full text