1 Introduction

Given a 1D discrete or continuous domain, an interval is defined by a starting and an ending point in this domain. Consider for example the domain of all non-negative integers \(\mathbb {N}\); two integers \(\mathsf {start},\mathsf {end} \in \mathbb {N}\), with \(\mathsf {start}\le \mathsf {end}\) define an interval \(i = [\mathsf {start},\mathsf {end}]\) as the subset of \(\mathbb {N}\), which includes all integers x with \(\mathsf {start}\le x\le \mathsf {end}\).Footnote 1 Let R, S be two collections of intervals. The interval join \(R\bowtie S\) is defined by all pairs of intervals \(r\in R\), \(s\in S\) that intersect, i.e., \(r.\mathsf {start}\le s.\mathsf {start}\le r.\mathsf {end}\) or \(s.\mathsf {start}\le r.\mathsf {start}\le s.\mathsf {end}\).

The interval join is one of the most widely used operations in temporal databases [16]. Generally speaking, temporal databases store relations of explicit attributes that conform to a schema and each tuple carries a validity interval. In this context, an interval join would find pairs of tuples from two relations which have intersecting validity. For example, assume that the employees of a company may be employed at different departments during different time periods. Given the employees in Fig. 1 who have worked in departments A (red), B (blue), the interval join would find pairs of employees, whose periods of work in A and B, respectively, overlap.

Fig. 1
figure 1

Motivation example in temporal databases

Interval joins find application in other domains as well. In multidimensional spaces, an object can be represented as a set of intervals from a space-filling curve. The intervals correspond to the subsequences of points on the curve that are included in the object. Spatial joins can then be reduced to interval joins in the space-filling curve representation [22]. The filter-step of spatial joins between sets of objects approximated by minimum bounding rectangles (MBRs) can also be processed by finding intersecting pairs in one dimension (i.e., an interval join) and verifying the intersection in the other dimension on-the-fly [2, 7, 36]. Another application is uncertain data management. Uncertain values are represented as intervals (which can be paired with confidence values). Thus, equi-joins on the uncertain attributes of two relations translate to interval joins [11].

Most of the previous works on interval joins [13, 15, 18, 32, 34] assume that the input data reside on disk and their objective is to minimize I/O accesses during the join. Such a setting becomes less relevant in contemporary in-memory data management and the wide availability of parallel and distributed platforms and models. Hence, the classic plane sweep (PS) algorithm [31] for in-memory join evaluation has not been the focus in most of the previous work. A recent paper [29] proposed an optimized PS algorithm (taken from [2]), called Endpoint-Based Interval (\(\mathsf {EBI}\)) Join. \(\mathsf {EBI}\) sorts the endpoints of all intervals (from both R and S) and then sweeps a line which stops at each of the sorted endpoints. As the line sweeps, \(\mathsf {EBI}\) maintains the active sets of intervals from R and S which intersect with the current stop point of the line to output the join results.

The work of [29] focused on minimizing the random memory accesses due to the updates and scans of the active sets. To this end, a special data structure called gapless hash map was proposed. However, random accesses can be overall avoided by another implementation of PS, presented in [7] for MBR (i.e., spatial) joins. We call this version forward scan (\(\mathsf {FS}\)) based PS. In a nutshell, \(\mathsf {FS}\) sweeps all intervals in increasing order of their \(\mathsf {start}\) points. For each interval encountered (e.g., \(r \in R\)), \(\mathsf {FS}\) scans forward the list of intervals from the other set (e.g., S). All such intervals having their \(\mathsf {start}\) point before the \(\mathsf {end}\) of r form join results with r. The cost of \(\mathsf {FS}\) (excluding sorting) is \(O(|R|+|S|+|R\bowtie S|)\), where \(|R \bowtie S|\) is the number of join results.

Contributions In this work, we investigate the in-memory computation of interval joins, taking advantage of the parallel processing offered by modern multi-core hardware. Our contributions are twofold. First, we study the single-threaded computation of interval joins, by presenting four novel optimizations for the \(\mathsf {FS}\) algorithm, which greatly reduce its computational cost. In particular, optimized \(\mathsf {FS}\) manages to produce multiple join tuples in batch at the cost of a single comparison or even output some results with zero comparisons. The performance of \(\mathsf {FS}\) is further enhanced by careful storage of the intervals in main memory, which reduces cache misses. Overall, we achieve competitive or better performance to the state-of-the-art PS algorithm (\(\mathsf {EBI}\) [29]), without using any special data structures.

Second, we study the in-memory parallel computation of interval joins. We investigate two approaches that differ on whether they physically partition the inputs. Our no-partitioning method operates in a master-slaves manner; the master CPU thread sweeps input intervals, while slave threads perform independent forward scans in parallel. For partitioning-based parallel processing, we first show the limitations of the hash-based partitioning framework from [29]. Then, we propose a novel, domain-based partitioning instead. Although intervals should be replicated in the domain partitions to ensure correctness, as we show, duplicate results can be avoided, therefore the partition-join jobs can become completely independent. To minimize the number of comparisons and also achieve load balancing, we break down each partition-join into five independent mini-join jobs with varying costs; in practice, only one of these mini-joins has the complexity of the original join problem, while the others have a significantly lower cost. We show how to schedule these mini-joins to the available CPU threads. To improve the cost balancing between the partition-joins, we also suggest an adaptive splitting approach. Finally, we present and evaluate three strategies for the partitioning phase which benefit from modern hardware. Our experimental analysis shows that the domain-based partitioning framework, after employing all the proposed optimizations, achieves high speedup with the number of threads, greatly outperforming both the hash-based partitioning framework of [29] and the no-partitioning approach.

Comparison to our previous work This article significantly extends a preliminary version of our work [5] in a number of directions. First, we design two additional optimization techniques for \(\mathsf {FS}\) which further boost its performance. All optimizations are thoroughly evaluated, including new experiments to provide better insights. Second, we provide a rule of the thumb that decides which optimizations to apply, based on the characteristics of the join inputs. Accordingly, we devise \(\mathsf {optFS}\), a self-tuning version of \(\mathsf {FS}\), which automatically selects and applies the most appropriate optimizations. Third, we present a specialized version of \(\mathsf {FS}\) for interval self-joins, i.e., when we seek overlapping pairs of intervals in a single collection. Fourth, we discuss and evaluate a new approach for parallel processing which does not physically partition the inputs. Fifth, we investigate alternative strategies for the partitioning phase of the join. Finally, we conduct new tests to investigate the best setup for each parallel processing framework.

Outline The rest of the article is organized as follows. First, Sect. 2 discusses related work while Sect. 3 reviews in more detail plane sweep methods; \(\mathsf {EBI}\) [29] and original \(\mathsf {FS}\) [7]. Then, we discuss the single-threaded join evaluation. Section 4 details our optimizations for \(\mathsf {FS}\), Sect. 5 discusses self-joins and Sect. 6 presents our experimental analysis which demonstrates the effect of our \(\mathsf {FS}\) optimizations. Next, we discuss the parallel computation of interval joins. Section 7 presents two novel parallel techniques, termed no-partitioning and domain-based partitioning, Sect. 8 details our strategies for parallelizing the partitioning phase and Sect. 9 presents the second part of our experiments that demonstrates the efficiency of our parallel interval join framework. Last, Sect. 10 concludes the paper.

2 Related work

We classify previous works based on the data structures they use and on the underlying architecture.

Nested loops and merge join Early work on interval joins [18, 32] studied a temporal join problem, where two relations are equi-joined on a non-temporal attribute and the temporal overlaps of joined tuple pairs should also be identified. Techniques based on nested-loops (for unordered inputs) and on sort-merge join (for ordered inputs) were proposed, as well as specialized data structures for append-only databases. Similar to plane sweep, merge join algorithms require the two input collections to be sorted, but join computation is sub-optimal compared to \(\mathsf {FS}\), which guarantees at most \(|R|+|S|\) comparisons that do not produce results.

Index-based algorithms Enderle et al. [15] propose interval join algorithms, which operate on two RI-trees [23] that index the input collections. Zhang et al. [37] focus on finding pairs of records in a temporal database that intersect in the (key, time) space (i.e., a problem similar to that studied in [18, 32]), proposing an extension of the multi-version B-tree [3].

Partitioning-based algorithms A partitioning-based approach for interval joins was proposed in [34]. The domain is split into disjoint ranges. Each interval is assigned to the partition corresponding to the last domain range it overlaps. The domain ranges are processed sequentially from last to first; after the last pair of partitions are processed, the intervals which overlap the previous domain range are migrated to the next join. This way data replication is avoided. Histogram-based techniques for defining good partition boundaries were proposed in [33]. A more sophisticated partitioning approach, called Overlap Interval Partitioning (\(\mathsf {OIP}\)) Join [13], divides the domain into equal-sized granules and consecutive granules define the ranges of the partitions. Each interval is assigned to the partition corresponding to the smallest sequence of granules that contains it. In the join phase, partitions of one collection are joined with their overlapping partitions from the other collection. \(\mathsf {OIP}\) was shown to be superior compared to index-based approaches [15] and sort-merge join. These results are consistent with the comparative study of [16], which shows that partitioning-based methods are superior to nested loops and merge join approaches.

Disjoint Interval Partitioning (\(\mathsf {DIP}\)) [8] was recently proposed for temporal joins and other sort-based operations on interval data (e.g, temporal aggregation). The main idea behind \(\mathsf {DIP}\) is to divide each of the two input relations into partitions, such that each partition contains only disjoint intervals. Every partition of one input is then joined with all of the other. Since intervals in the same partition do not overlap, sort-merge computations are performed without backtracking. Prior to this work, temporal aggregation was studied in [26]. Given a large collection of intervals (possibly associated with values), the objective is to compute an aggregate (e.g., count the valid intervals) at all points in time. An algorithm was proposed in [26] which divides the domain into partitions (buckets), assigns the intervals to the first and last bucket they overlap and maintains a meta-array structure for the aggregates of buckets entirely covered by intervals. The aggregation can then be processed independently for each bucket (e.g., using a sort-merge based approach) and the algorithm can be parallelized in a shared-nothing architecture. We also propose a domain-partitioning approach for parallel processing (Sect. 7), but the details differ due to the different natures of temporal join and aggregation.

Methods based on plane sweep The Endpoint-Based Interval (\(\mathsf {EBI}\)) Join [29] (reviewed in Sect. 3.1) and its lazy version \(\mathsf {LEBI}\) were shown to significantly outperform \(\mathsf {OIP}\) [13] and to also be superior to another plane sweep implementation [2]. An approach similar to \(\mathsf {EBI}\) is used in SAP HANA [21]. To our knowledge, no previous work was compared to \(\mathsf {FS}\) [7] (detailed in Sect. 3.2). In Sect. 4, we propose four optimizations for \(\mathsf {FS}\) that greatly improve its performance, making it competitive or even faster than \(\mathsf {LEBI}\). Last, extensions and applications of the plane sweep approach has been discussed in [6, 10], but in the context of temporal aggregation and SPARQL query processing, respectively.

Parallel algorithms A domain-based partitioning strategy for interval joins on multi-processor machines was proposed in [24]. Each partition is assigned to a processor and intervals are replicated to the partitions they overlap, to allow join results being produced independently at each processor. At the end, a merge phase with duplicate elimination is required as the same join result can be produced by different processors. Duplicates can be avoided using the reference test from [14] but, this approach incurs extra comparisons. Our parallel processing approach in Sect. 7 also applies a domain-based partitioning but produces no duplicates. Also, we propose a breakdown of each partition join to a set of mini-join jobs, which has never been considered in previous work.

Distributed algorithms Distributed interval joins were studied in [22]. The goal is to join sets of intervals located at different clients. The clients iteratively exchange statistics with the server, which help the latter to compute a coarse-level approximate join; exact results are refined by on-demand communication with the clients. Chawda et al. [9] implement the partitioning algorithm of [24] in the MapReduce framework and extend it to operate for other (non-overlap) join predicates. The main goal of distributed algorithms is to minimize the communication cost between the machines that hold the data and compute the join.

3 Plane sweep for interval joins

This section presents the necessary background on plane sweep based computation of interval joins. First, we detail the \(\mathsf {EBI}\) algorithm [29]. Then, we review the forward scan based algorithm from [7], which has been overlooked by previous work. Both methods take as input collections R, S of intervals and compute all (rs) pairs with \(r\in R, s\in S\), that intersect. We denote by \(r.\mathsf {start}\) (\(r.\mathsf {end}\)) the starting (ending) endpoint of an interval r.

3.1 Endpoint-Based Interval Join

figure a

\(\mathsf {EBI}\) [29] is based on the internal-memory plane sweep technique of [31], but tailored to modern hardware. Algorithm 1 illustrates the pseudo-code of \(\mathsf {EBI}\). \(\mathsf {EBI}\) represents each input interval, e.g., \(r \in R\), by two tuples in the form of \(\langle \mathsf {endpoint},\mathsf {type},\) \(\mathsf {id}\rangle \), where \(\mathsf {endpoint}\) equals either \(r.\mathsf {start}\) or \(r.\mathsf {end}\), \(\mathsf {type}\) flags whether \(\mathsf {endpoint}\) is a starting or an ending endpoint, and \(\mathsf {id}\) is the identifier of r. These tuples are stored inside the endpoint indices \(EI^R\) and \(EI^S\), sorted primarily by their \(\mathsf {endpoint}\) and secondarily by \(\mathsf {type}\). To compute the join, \(\mathsf {EBI}\) concurrently scans the endpoint indices, accessing their tuples in increasing global order of their sorting key, simulating a “sweep line” that stops at each endpoint from either R or S. At each position of the sweep line, \(\mathsf {EBI}\) keeps track of the intervals that have started but not finished, i.e., the index tuples that are \(\mathsf {start}\) endpoints, for which the index tuple having the corresponding \(\mathsf {end}\) endpoint has not been accessed yet. Such intervals are called active and they are stored inside sets \(A^R\) and \(A^S\); \(\mathsf {EBI}\) updates these active sets depending on the \(\mathsf {type}\) entry of current index tuple (Lines 10 and 14 for collection R and Lines 19 and 23 for S). Finally, for a current index tuple (e.g., \(e^R\)) of \(\mathsf {type}\) START, the algorithm iterates through the active intervals of the opposite input (e.g., \(A^S\) on Lines 11–12) to produce the next bunch of results (e.g., the intervals of S that join with \(e^R.\) \(\mathsf {id}\)).

By recording the active intervals from each collection, \(\mathsf {EBI}\) can directly report the join results without any endpoint comparisons. To achieve this, the algorithm needs to store and scan the endpoint indices which contain twice the amount of entries compared to the input collections. Hence excluding the sorting cost for \(EI^R\) and \(EI^S\), \(\mathsf {EBI}\) conducts \(2\cdot (|R|+|S|)\) endpoint comparisons to advance the sweep line, in total. However, the critical overhead of \(\mathsf {EBI}\) is the maintenance and scanning of the active sets at each loop; i.e., Lines 10 and 19 (add), Lines 11–12 and 20–21 (scan), Lines 14 and 23 (remove). This overhead can be quite high; for example, typical hash map data structures support efficient O(1) updates but scanning their contents is slow. To deal with this issue, Piatov et al. designed a special hash table termed the gapless hash map which efficiently supports all three insert, remove and getNext operations. Finally, the authors further optimized the join computation by proposing a lazy evaluation technique which buffers consecutive index tuples of \(\mathsf {type}\) START (and hence, their corresponding intervals) as long as they originate from the same input (e.g., R). When producing the join results, a single scan over the active set of the opposite collection (e.g., \(A^S\)) is performed for the entire buffer. This idea is captured by the Lazy Endpoint-Based Interval (\(\mathsf {LEBI}\)) Join algorithm. By keeping the buffer size small enough to fit inside the L1 cache or even the cache registers, \(\mathsf {LEBI}\) greatly reduces main memory cache misses and hence, outperforms \(\mathsf {EBI}\) even more.

3.2 Forward scan-based plane sweep

The experiments in [29] showed that \(\mathsf {LEBI}\) outperforms not only \(\mathsf {EBI}\), but also the plane sweep algorithm of [2], which directly scans the inputs ordered by \(\mathsf {start}\) endpoint and keeps track of the active intervals in a linked list. Intuitively, both approaches perform a backward scan, i.e., a scan of already encountered intervals, organized by a data structure that supports scans and updates. In practice however, the need to implement a special structure may limit the applicability and the adoption of these evaluation approaches while also increasing the memory space requirements.

figure b

In [7], Brinkhoff et al. presented a different implementation of plane sweep, which performs a forward scan directly on the input collections and hence, (i) there is no need to keep track of active sets in a special data structure and (ii) data scans are conducted sequentially.Footnote 2 Algorithm 2 illustrates the pseudo-code of this method, denoted by \(\mathsf {FS}\). First, both inputs are sorted by the \(\mathsf {start}\) endpoint of each interval. Then, \(\mathsf {FS}\) sweeps a line, which stops at the \(\mathsf {start}\) endpoint of all intervals of R, S in order. For each position of the sweep line, corresponding to the start of an interval, say \(r \in R\), the algorithm produces join results by combining r with all intervals from the opposite collection, that start (i) after the sweep line and (ii) before \(r.\mathsf {end}\), i.e., all \(s' \in S\) with \(r.\mathsf {start} \le s'.\mathsf {start} \le r.\mathsf {end}\) (internal while-loops on Lines 7–10 and 13–16). Excluding the cost of sorting R and S, \(\mathsf {FS}\) conducts \(|R|+|S|+|R\bowtie S|\) point comparisons, in total. Specifically, each interval \(r\in R\) (the case for S is symmetric) is compared to just one \(s'\in S\) which does not intersect r in the loop at Lines 8–10.

4 Optimizing \(\mathsf {FS}\)

We present four optimization techniques for \(\mathsf {FS}\) that can greatly enhance its performance. Naturally, the cost of \(\mathsf {FS}\) cannot be asymptotically reduced; \(|R| + |S|\) endpoint comparisons is the unavoidable cost of advancing the sweep line. However, it is possible to reduce the number of \(|R\bowtie S|\) comparisons required to produce the join results, which is the focus of the first two optimization techniques termed grouping and bucket indexing. In addition, low level code engineering and careful data layout in main memory can further improve the running time of \(\mathsf {FS}\), which is the focus of our enhanced loop unrolling and decomposed data layout techniques.

4.1 Grouping

The intuition behind our first optimization technique is to group consecutively sweeped intervals from the same collection and produce join results for them in batch, avoiding redundant comparisons. We exemplify this idea in Fig. 2, which depicts intervals \(\{r_1,r_2\} \in R\) and \(\{s_1,s_2,s_3,s_4,s_5\} \in S\) sorted by \(\mathsf {start}\) endpoint. Assume that \(\mathsf {FS}\) has already examined \(s_1\); since \(r_1.\mathsf {start} < s_2.\mathsf {start}\), the next interval where the sweep line stops is \(r_1\). Algorithm 2 (Lines 7–10) then forwardly scans through the shaded area in Fig. 2a from \(s_2.\mathsf {start}\) until it reaches \(s_5.\mathsf {start}>r_1.\mathsf {end}\), producing result pairs \(\{(r_1,s_2),\) \((r_1,s_3), (r_1,s_4)\}\). The next stop of the sweep line is \(r_2.\mathsf {start}\), since \(r_2.\mathsf {start} < s_2.\mathsf {start}\). \(\mathsf {FS}\) scans through the shaded area in Fig. 2b producing results \(\{(r_2,s_2),\) \((r_2,s_3)\}\). We observe that the scanned areas of \(r_1\) and \(r_2\) are not disjoint, which in practice means that \(\mathsf {FS}\) performed redundant endpoint comparisons. Indeed, this is the case for \(s_2.\mathsf {start}\) and \(s_3.\mathsf {start}\) which were compared to both \(r_1.\mathsf {end}\) and \(r_2.\mathsf {end}\). However, since \(r_1.\mathsf {end} > r_2.\mathsf {end}\) holds, \(r_2.\mathsf {end} > s_2.\mathsf {start}\) automatically implies that \(r_1.\mathsf {end} > s_2.\mathsf {start}\); therefore, pairs \((r_1,s_2)\), \((r_2,s_2)\) could have been reported by comparing only \(r_2.\mathsf {end}\) to \(s_2.\mathsf {start}\). Hence, processing consecutively sweeped intervals from the same collection (e.g., \(r_1\) and \(r_2\)) as a group allows us to scan their common areas only once.

Fig. 2
figure 2

Scanned areas by \(\mathsf {FS}\), \(\mathsf {gFS}\), \(\mathsf {bFS}\) and \(\mathsf {bgFS}\) for \(r_1\) and \(r_2\); with grouping \(r_2\) precedes \(r_1\). Underlined result pairs are produced without any endpoint comparisons

figure c

Algorithm 3 illustrates the pseudo-code of \(\mathsf {gFS}\), which enhances \(\mathsf {FS}\) with the grouping optimization. Instead of processing a single interval at a time, \(\mathsf {gFS}\) considers a group of consecutive intervals from the same collection at a time. Specifically, assume that at the current loop \(r.\mathsf {start}<s.\mathsf {start}\) (the other case is symmetric). Starting from r, \(\mathsf {gFS}\) accesses all \(r'\in R\) with \(r'.\mathsf {start}<s.\mathsf {start}\) (Line 7) and puts them in a group \(G^R\). Next, the contents of \(G^R\) are reordered by increasing \(\mathsf {end}\) endpoint (Line 8). Then, \(\mathsf {gFS}\) initiates a forward scan on S starting from \(s'=s\) (Lines 9–14), but unlike \(\mathsf {FS}\) the scan is done only once for all intervals in \(G^R\). For each \(r_i \in G^R\) in the new order, if \(s'.\mathsf {start}\le r_i.\mathsf {end}\), then \(s'\) intersects not only \(r_i\) but also all intervals in \(G^R\) after \(r_i\) (due to the sorting of \(G^R\) by \(\mathsf {end}\)). If \(s'.\mathsf {start}> r_i.\mathsf {end}\), then \(s'\) does not join with \(r_i\) but may join with succeeding intervals in \(G^R\), so the for loop proceeds to the next \(r_i\in G^R\).

Figure 2c, d exemplify \(\mathsf {gFS}\) for intervals \(r_1\) and \(r_2\) grouped under \(G^R\); as \(r_1.\mathsf {end} > r_2.\mathsf {end}\), \(r_2\) is considered first. When the shaded area in Fig. 2c from \(s_2.\mathsf {start}\) until \(s_4.\mathsf {start}\) is scanned, \(\mathsf {gFS}\) produces results that pair both \(r_2\) and \(r_1\) with covered intervals \(s_2\) and \(s_3\) from S, by comparing \(s_2.\mathsf {start}\) and \(s_3.\mathsf {start}\) only to \(r_2.\mathsf {end}\). Intuitively, avoiding redundant endpoint comparisons corresponds to removing the overlap between the scanned areas of consecutive intervals; compare \(r_1\)’s scanned area by \(\mathsf {gFS}\) in Fig. 2d to the area in Fig. 2b by \(\mathsf {FS}\) after removing the overlap with \(r_2\)’s area.

Discussion and implementation details The grouping technique of \(\mathsf {gFS}\) differs from the buffering employed by \(\mathsf {LEBI}\) [29]. First, \(\mathsf {LEBI}\) groups consecutive \(\mathsf {start}\) endpoints in a sort order that includes 4 sets of items, whereas in \(\mathsf {gFS}\) there are only 2 sets of items (i.e., only \(\mathsf {start}\) endpoints of the two collections). As a result, the groups in \(\mathsf {gFS}\) are likely to be larger than \(\mathsf {LEBI}\) ’s buffer (and larger groups make \(\mathsf {gFS}\) more efficient). Second, the buffer in \(\mathsf {LEBI}\) is solely employed for outputting results while groups in \(\mathsf {gFS}\) also facilitate the avoidance of redundant endpoint comparisons due to the reordering of groups by \(\mathsf {end}\) endpoint.

Regarding the implementation of grouping in \(\mathsf {gFS}\), we experimented with two different approaches. In the first approach, each group is copied to and managed in a dedicated array in main memory. The second approach retains pointers to the begin and end index of each group in the corresponding collection; the segment of the collection corresponding to the group is re-sorted (note that correctness is not affected by this). Our tests showed that the first approach is always faster, due to the reduction of cache misses during the multiple scans of the group (i.e., Lines 12-13 and Lines 22-23).

4.2 Bucket indexing

Our second optimization technique extends \(\mathsf {FS}\) to avoid even more endpoint comparisons during the computation of the join results. The idea is as follows. First, we split the domain into a predefined number of equally-sized disjoint stripes; all intervals from R (resp. S) that start within a particular stripe are stored inside a dedicated bucket of the \(BI^R\) (resp. \(BI^S\)) bucket index. Figure 3 exemplifies the domain stripes and the bucket indices for the interval collections of Fig. 2.Footnote 3

Fig. 3
figure 3

Bucket indexing: domain stripes and \(BI^R\), \(BI^S\) bucket indices for the intervals of Fig. 2

figure d

With the bucket indices, the area scanned by \(\mathsf {FS}\) for an interval is entirely covered by a range of stripes. Consider Fig. 2c, e; \(r_1\)’s scanned area lies inside four stripes which means that the involved intervals from S start between the \(BI^S\) bucket covering \(s_2.\mathsf {start}\) and the \(BI^S\) bucket covering \(r_1.\mathsf {end}\). In this spirit, area scanning resembles a range query over the bucket indices. Hence, every interval \(s_i\) from a bucket completely inside \(r_1\)’s scanned area or lying after \(s_2\) in the first bucket, can be paired to \(r_1\) as join result without any endpoint comparisons; by definition of the stripes/buckets, for such intervals \(s_i.\mathsf {start} \le r_1.\mathsf {end}\). So, we only need to conduct endpoint comparisons for the \(s_i\) intervals from the bucket that covers \(r_1.\mathsf {end}\). This distinction is graphically shown in Fig. 2e, f where solid gray areas are used to directly produce join results with no endpoint comparisons. Observe that, for this example, both join results produced when \(\mathsf {FS}\) performs a forward scan for \(r_2\) are directly reported when using the bucket indexing. On the other hand, bucket indexing enables us to directly report only two of the three join results for \(r_1\) as the bucket that contains \(s_4\) is not completely inside \(r_1\)’s scanned area.

Algorithm 4 illustrates the pseudo-code of \(\mathsf {bFS}\) which enhances \(\mathsf {FS}\) with bucket indexing. Essentially, \(\mathsf {bFS}\) operates similar to \(\mathsf {FS}\). Their main difference lies in the forward scan for the current interval. Without loss of generality, consider \(r \in R\) (the case of \(s \in S\) is symmetric); Lines 8–14 implement the range query discussed in the previous paragraph. The algorithm first identifies bucket \(B \in BI^S\) which covers \(r.\mathsf {end}\). Then, it iterates through the \(s' \in S\) intervals after current s, originating from all buckets before B to directly produce join results on Lines 9–11 without any endpoint comparisons, while finally on Lines 12–14, the intervals of B are scanned and compared exactly as in \(\mathsf {FS}\).

Discussion and implementation details In our implementation, we choose not to materialize the index buckets, i.e., no intervals are copied to dedicated data structures. We store for each bucket a pointer to the last interval in it; this allows \(\mathsf {bFS}\) to efficiently perform the forward scans. With this design, we guarantee a small main memory footprint for our method as there is no need to practically store a second copy of the data.

4.3 Enhanced loop unrolling

Our third optimization builds upon a code transformation technique known as loop unrolling or loop unwinding [1, 27, 28]. Essentially, the goal of loop unrolling is to reduce the execution time by (i) eliminating the overhead of controlling a loop (i.e., checking its exit condition) and the latency due to main memory accesses, and (ii) reducing branch penalties. Such a transformation can be carried either manually by the programmer or automatically by the compiler; our focus is on the former case.

The idea of manual unrolling involves the re-writing of the loop as a repeated sequence of similar independent statements. For example, a loop which processes the 1000 elements of an array can be modified to perform only 100 iterations using a so-called unrolling factor of 10; i.e., every iteration of the new loop executes 10 identical and independent element processing statements. In this spirit, a straightforward way to benefit from loop unrolling will be to unfold the forward scan loop on Lines 7–9 of Algorithm 2 (the case of Lines 13–15 is symmetric) by a factor of x. Under this, the exit condition \(s' \ne null\) will checked only once for every x-th interval. Also, every iteration of the new loop checks the \(r.\mathsf {end} \ge s'.\mathsf {start}\) overlap condition on each of the next x \((r,s')\) pairs and if so, the pair is output.

Despite its positive effect on reducing the loop cost, this straightforward approach would still incur the same number of endpoint comparisons as the forward scan of \(\mathsf {FS}\), because the \(r.\mathsf {end} \ge s'.\mathsf {start}\) condition is checked for every reported pair. In view of this, we propose an adaptation termed the enhanced loop unrolling which skips endpoint comparisons to accelerate \(\mathsf {FS}\). Specifically, instead of checking \(r.\mathsf {end} \ge s'.\mathsf {start}\) for every \((r,s')\) pair, we check whether this condition holds for the x-th \(s'\). If so, all x intervals are guaranteed to pair with current interval r, the x pairs are reported without the need of any comparisons, and we proceed to the next x intervals. Otherwise (i.e., if \(r.\mathsf {end} < s'.\mathsf {start}\)) the x-th \(s'\) interval does not overlap r and therefore, we need to scan the \(x-1\) intervals similar to \(\mathsf {FS}\). We denote by \(\mathsf {uFS}\) the extension of \(\mathsf {FS}\) which employs the enhanced loop unrolling optimization.

Figure 4 illustrates the functionality and the effect of the enhanced loop unrolling. Fix current interval r from collection R, which overlaps with 8 intervals from S. The forward scan of \(\mathsf {FS}\) accesses 9 \(s'\) intervals, conducting 9 endpoint comparisons for the \(r.\mathsf {end} \ge s'.\mathsf {start}\) condition. The last comparison is needed to terminate the forward scan, i.e., it determines the first \(s'\) interval that starts after \(r.\mathsf {end}\). On the other hand, \(\mathsf {uFS}\) with an unrolling factor of 4 requires only 4 endpoint comparisons, in total. Specifically, the \(r.\mathsf {end} \ge s'.\mathsf {start}\) condition is initially checked for the fourth interval in S; since, the condition holds, all first 4 \(s'\) intervals overlap current r. The next 4 \(s'\) intervals are examined in the same manner. Last, \(\mathsf {uFS}\) checks the \(r.\mathsf {end} \ge s'.\mathsf {start}\) condition for the twelfth \(s'\) interval. As \(r.\mathsf {end} < s'.\mathsf {start}\), the twelfth interval from S does not overlap r, which means that \(\mathsf {uFS}\) will complete the forwards scan similar to \(\mathsf {FS}\) conducting an extra endpoint comparison for the ninth interval.

Fig. 4
figure 4

Enhanced loop unrolling: forward scans. Endpoint comparisons are colored in dark gray; direct results output with no comparisons are in light gray

Fig. 5
figure 5

Decomposed data layout: sweeping and scans

4.4 Decomposed data layout

We can further enhance \(\mathsf {FS}\) by carefully storing the input interval, in main memory. To demonstrate our intuition, consider again Algorithm 2 and the pseudo-code of \(\mathsf {FS}\). The algorithm essentially performs two operations; it advances the sweep line and forwardly scans the collections. We observe that neither of these operations considers every attribute from the input intervals. Specifically, in order to advance the sweep line the \(\mathsf {start}\) endpoints of the current intervals \(r \in R\) and \(s \in S\) are compared, while \(\mathsf {end}\) endpoints are of no use. Concerning forward scanning, assume without loss of generality that the current (fixed) interval is \(r \in R\) and so, \(\mathsf {FS}\) will next scan collection S (the case of forwardly scanning R is symmetric). Essentially, the algorithm needs only the \(\mathsf {end}\) endpoint of current interval r and the \(\mathsf {start}\) endpoint of every scanned interval \(s'\) from S, in order to check the \(r.\mathsf {end} \ge s'.\mathsf {start}\) condition in Line 7. On the other hand, both \(r.\mathsf {start}\) and \(s'.\mathsf {end}\) for every examined \(s'\) are of no use to the forward scan operation.

Based on this observation, our last technique is inspired by the Decomposition Storage Model (DSM) [12], adopted by column-oriented database systems (e.g., [35]). Instead of storing an input collection as an array of \(\langle \mathsf {start},\mathsf {end}\rangle \) tuples, we decompose it into two separate arrays; one having the \(\mathsf {start}\) endpoints and one with \(\mathsf {end}\) endpoints. With this decomposition, the algorithm can iterate only over the \(\mathsf {start}\) arrays when advancing the sweep line or forward scanning, which results in a smaller footprint in main memory and reduces the number of cache misses. We denote by \(\mathsf {dFS}\) the extension of \(\mathsf {FS}\) that employs our decomposed data layout. Figure 5 illustrates our decomposed data layout for \(\mathsf {dFS}\) compared to the data layout for \(\mathsf {FS}\).

4.5 Employing all optimizations

We finally discuss how all proposed optimization techniques can be put together in \(\mathsf {FS}\). Grouping and bucket indexing optimize \(\mathsf {FS}\) in an orthogonal manner; hence, it is possible to pair the optimizations resulting to what we call \(\mathsf {FS}\) with grouping and bucket indexing (\(\mathsf {bgFS}\)). Figures 2(g) and (h) exemplify \(\mathsf {bgFS}\) for intervals \(r_2\) and \(r_1\) (sorted by \(\mathsf {end}\) endpoints) and their group \(G^R\). Compared to \(\mathsf {bFS}\), the algorithm iterates through the same buckets regarding \(r_2\)’s scanned area, but produces join results for both \(r_2\) and \(r_1\) at the same time, similarly to \(\mathsf {gFS}\). Regarding \(r_1\)’s scanned area, \(\mathsf {bgFS}\) operates exactly as \(\mathsf {gFS}\) since the area is covered by a single bucket.

Essentially, the pseudo-code of \(\mathsf {bgFS}\) would resemble Algorithm 4 of \(\mathsf {bFS}\) with the exceptions of having to form groups and how the forward scans are performed. Similar to \(\mathsf {gFS}\) and Lines 6–7 in Algorithm 3, \(\mathsf {bgFS}\) groups together consecutive intervals from the same input and reorders the contents of each group by their increasing \(\mathsf {end}\) endpoint. Then, Lines 9–11 and 12–14 are adjusted according to Lines 9–13 in Algorithm 3 of \(\mathsf {gFS}\), where a forward scan is performed for an entire group instead of a single interval. The case of grouping on collection S is symmetric.

The performance of \(\mathsf {bgFS}\) can be further improved by the enhanced loop unrolling and adopting the decomposed data layout. Plugging enhanced loop unrolling into \(\mathsf {bgFS}\) is straightforward and so is pairing our decomposed data layout with bucket indexing. Grouping can be enhanced by carefully decomposing the group data. Without loss of generality consider \(\mathsf {gFS}\); the same approach can be applied for \(\mathsf {bgFS}\) and \(\mathsf {bguFS}\). Similarly to \(\mathsf {FS}\), we observe that forward scans on collection S in (Lines 9–13, Algorithm 3) take into account only the \(\mathsf {end}\) endpoint of each interval in group \(G^R\) (the case of forward scanning R is symmetric). In fact, \(\mathsf {start}\) for r intervals is used only to form the group in Line 6 before the forward scan commences. Hence, we can model every group as two arrays. Figure 5 illustrates this idea. Originally, all \(\mathsf {gFS}\) operations are conducted under the original layout where both the input collections and created groups are stored in arrays of \(\langle \mathsf {start},\mathsf {end}\rangle \) tuples. In contrast, by employing our decomposed layout advancing the sweep line and forward scan operations use only the \(\mathsf {start}\) arrays whereas group scans (i.e., the for loops in Line 9 and 19) operate on the \(\mathsf {end}\) arrays.

In Sect. 6.2, we experimentally study the effect of each of the four proposed optimization techniques. We also provide insights on how we can decide which of them should be activated depending on the characteristics of the input collections. To this end, we devise the \(\mathsf {optFS}\) method in Sect. 6.3.

Table 1 Characteristics of experimental datasets

5 The case of self-joins

Up to this point, we investigated only the case where the intervals from two distinct collections are joined. In this section, we discuss the case of a self-join, which receives a single collection as input R and looks for the pairs of intervals \((r_i,r_j) \subseteq R \times R\) that overlap. All interval join algorithms, which we have discussed already, can be directly applied to solve this problem, if we set the second input \(S=R\). However, such an approach requires a duplicate elimination post-processing step (or an extra comparison for each computed pair), otherwise every \((r_i, r_j)\) would be reported twice, increasing the total number of results to \((2\cdot |R\bowtie R| - |R|)\). Consider, for example, the collection \(R = \{r_1[3,5], r_2[4, 6], r_3[7,11]\}\). The result of the \(R\bowtie R\) self-join contains pairs \((r_1,r_1)\), \((r_1,r_2)\), \((r_2,r_2)\) and \((r_3,r_3)\). Now, assume we use \(\mathsf {FS}\) from Algorithm 2 to compute this join by setting \(S = R\). The sweep line will first stop at \(r_1\); the forward scan on S will start from \(s_1\) and output \((r_1,s_1)\) and \((r_1,s_2)\), which correspond to \((r_1,r_1)\) and \((r_1,r_2)\). The next interval will be \(s_1\); the forward scan will start from the current interval from R, which was set to \(r_2\) at the end of the first forward scan, and hence, output \((s_1,r_2)\) (i.e., \((r_1,r_2)\)) for a second time.

figure e

To address this issue, we design a simplified version of \(\mathsf {FS}\) which pairs an interval r only with itself and intervals from the collection that come after r in the sort order.Footnote 4 Algorithm 5 illustrates the pseudo-code for the self-join version of \(\mathsf {FS}\). Going back to the previous example, the forward scan for \(r_1\) will produce \((r_1,r_2)\) but the forward scan for \(r_2\) will start from \(r_3\) and so, avoid duplicate results.

All our proposed optimizations can be applied on the self-join \(\mathsf {FS}\). The case of bucket indexing is straightforward; in practice, only one bucket index is defined and Algorithm 5 is extended accordingly to Algorithm 4. Enhanced loop unrolling and decomposed data layout for self-joins operate exactly as discussed in Sects. 4.3 and 4.4, respectively. On the other hand, we reconsider our grouping optimization, as all intervals are essentially consecutive from the same input. The solution is to group together intervals with exactly the same \(\mathsf {start}\) endpoint. Last, special care is taken for the group scan of \(\mathsf {gFS}\) (i.e., corresponding to the for loop in Lines 9 and 9, Algorithm 3). Specifically, to avoid duplicate results the i-th interval of a group G is paired to itself and the \(|G|-i\) intervals that come after it inside G, in the sort order. Note that these results can be reported while constructing the group.

6 Experiments on single-threaded processing

We next present the first part of our experimental analysis on the single-threaded computation of interval joins.

6.1 Setup

Our single-threaded analysis was conducted on a machine with 384 GBs of RAM and a dual Intel(R) Xeon(R) CPU E5-2630 v4 clocked at 2.20GHz running CentOS Linux 7.3.1611. All methods were implemented in C++, compiled using gcc (v4.8.5) with flags -O3, -mavx and -march=native. We imported in our source code the implementations of \(\mathsf {EBI}\)/\(\mathsf {LEBI}\) [29], \(\mathsf {OIP}\) [13] and \(\mathsf {DIP}\) [8], kindly provided by the authors of the corresponding papers. The setup of our benchmark is similar to [29]; every interval contains two 64-bit endpoint attributes (i.e., \(\mathsf {start}\) and \(\mathsf {end}\)) while the workload accumulates the sum of an XOR between the \(\mathsf {start}\) attributes on every result pair. Note that all data (input collections, index structures etc.) reside in main memory.

Datasets We experimented with 6 real datasets, the majority of which was used in recent literature on interval joins; Table 1 details the characteristics of the datasets. BOOKS [5] records all transactions at Aarhus public libraries in 2013 (https://www.odaa.dk); valid times indicate the periods when a book is lent out. FLIGHTS [6] records domestic flights in USA during January 2016 (https://www.bts.gov); valid times indicate the duration of a flight. GREEND [8, 25] records power usage data from households in Austria and Italy from January 2010 to October 2014; valid times indicate the period of a measurement. INFECTIOUS [8, 19] stores visiting information from the “INFECTIOUS: stay Away!” exhibition at Science Gallery in Dublin, Ireland, from May to July 2009; valid times indicate when a contact between visitors occurred. TAXIS records taxi trips (pick-up, drop-off timestamp) from New York City (https://www1.nyc.gov/site/tlc/index.page) in 2013; valid times indicate the duration of each ride. WEBKIT [5, 6, 13] records the file history in the git repository of the Webkit project from 2001 to 2016 (https://webkit.org); valid times indicate the periods when a file did not change.

Fig. 6
figure 6

Selectivity of the tested join queries

Table 2 Tuning bucket indexing: \(\mathsf {bFS}\) execution time [secs] for \(|R|=|S|\); lowest time in bold

Queries We ran a series of interval join queries using uniformly sampled subsets of each dataset as the outer input R and the entire dataset as the inner S; for each setting, the |R|/|S| ratio varies inside \(\{0.25, 0.5,\) \(0.75, 1\}\).Footnote 5 To assess the performance of the evaluation methods, we measured their total execution time which includes sorting, indexing and partitioning costs (wherever applicable).

Figure 6 reports on the selectivity of our tested join queries; for each dataset and |R|/|S| value, the figure plots how many intervals overlap with an input interval, on average. Under this, our datasets can be essentially divided into 3 categories. Joins on GREEND and INFECTIOUS are highly selective as every interval overlaps with at most 10 others, on average. In contrast, the result sets on WEBKIT and BOOKS queries include over 10, 000 pairs for each input interval, on average. Queries on FLIGHTS and TAXIS lie in the middle, but they are significantly less selective than the GREEND and INFECTIOUS joins.

Tuning To tune our bucket indexing optimization, we ran a test for the \(|R|=|S|\) setting which monitored the execution time of \(\mathsf {bFS}\) while varying the number B of buckets or equivalently the number of domain stripes used. Table 2 reports on the results of this test; the lowest execution time for each dataset is highlighted in bold. We draw two important findings. First, bucket indexing is not effective on GREEND and INFECTIOUS; the lowest execution time was observed for \(B = 1\), i.e., when \(\mathsf {bFS}\) operates exactly as \(\mathsf {FS}\). We elaborate on this issue in the next section. On the other hand, increasing the number of buckets accelerates \(\mathsf {bFS}\) for BOOKS, FLIGHTS, TAXIS and WEBKIT joins. The best B value for all four datasets lies in between 10, 000 and 1, 000, 000; further increasing B eventually slows down \(\mathsf {bFS}\) because the domain is fragmented in too many stripes. Under this, we set the number of buckets for the rest of this article to 100, 000. Last, we set the loop unrolling factor to 32, similar to previous work in [29], such that every loop iteration can be processed as high as possible in the main memory cache hierarchy.

Fig. 7
figure 7

Optimizing \(\mathsf {FS}\): execution time

6.2 Optimizing \(\mathsf {FS}\)

We first study the effectiveness of our optimization techniques for \(\mathsf {FS}\), i.e., grouping, bucket indexing, enhanced loop unrolling and decomposed data layout, captured by methods \(\mathsf {gFS}\), \(\mathsf {bFS}\), \(\mathsf {uFS}\) and \(\mathsf {dFS}\), respectively. Figure 7 reports the execution time of the methods. To save space, we do not include a breakdown for the execution time of the methods. Nevertheless, the findings are similar to the case of one partition in Figures 11 and 13, i.e., for highly selective queries, sorting dominates the total computation cost.

Table 3 Grouping: extent of forward scan per input interval
Table 4 Grouping: average group size

Grouping We observe that the grouping optimization is effective in 4 out of our 6 experimental datasets. In fact, the execution times in Fig. 7 align with the join selectivities in Fig. 6. For the highly selective queries in GREEND and INFECTIOUS, \(\mathsf {gFS}\) is slower than \(\mathsf {FS}\). As these datasets contain very short intervals (see Table 1), a forward scan by \(\mathsf {FS}\) examines only a few intervals (10 or less on average, according to Fig. 6); recall that the forward scan for an interval, e.g., \(r \in R\), extents from the first interval in S which starts after r.start until the first interval in S which starts after r.end. As a result, any reduction in the average extent of the forward scan achieved by \(\mathsf {gFS}\) does not payoff in practice. Table 3 reports on the forward scan extent per interval by \(\mathsf {FS}\) and \(\mathsf {gFS}\).Footnote 6 Grouping induces a clear relative reduction of this extent for INFECTIOUS (approximately, one order of magnitude), but in absolute numbers the forward scans were very short and thus, cheap in the first place. An additional indicator for the ineffectiveness of grouping is the size of the created groups, reported in Table 4. Notice that for GREEND queries, groups contain less than two intervals on average; hence, \(\mathsf {gFS}\) does not provide any benefit over \(\mathsf {FS}\).

On the other hand, \(\mathsf {gFS}\) significantly outperforms \(\mathsf {FS}\), by a wide margin (up to one order of magnitude), for BOOKS, WEBKIT, FLIGHTS and TAXIS where the join queries return a large number of results. As the intervals in these datasets are significantly longer compared to GREEND and INFECTIOUS, a forward scan by \(\mathsf {FS}\) examines a large number of intervals and consequently conducts a large number of endpoint comparisons. In this context, grouping consecutive intervals from the same input and performing a single forward scan for the entire group enables \(\mathsf {gFS}\) to massively produce result pairs and avoid redundant comparisons. In fact, the performance gain of \(\mathsf {gFS}\) over \(\mathsf {FS}\) grows with |R|/|S|, as the extent of the forward scans increases and the join queries become computationally harder. Last, we observe that the effectiveness of grouping increases also with the size of the created groups; notice how much \(\mathsf {gFS}\) outperforms \(\mathsf {FS}\) in BOOKS where each group contains some hundreds of intervals.

Table 5 Bucket indexing: percentage of the join results produced without endpoint comparisons.

Bucket indexing Similar to grouping, the effectiveness of the bucket indexing optimization depends on the extent of the forward scans. Recall from Sect. 4.2 that \(\mathsf {bFS}\) performs the forward scans as range queries over the domain stripes; buckets for stripes entirely contained inside the forward scan areas provide direct join results, i.e., without the need for additional endpoint comparisons. The longer forward scans are, the more stripes are entirely covered and hence, a larger number of redundant comparisons are avoided. Under this, \(\mathsf {bFS}\) outperforms \(\mathsf {FS}\) for all |R|/|S| values on BOOKS, FLIGHTS, TAXIS and WEBKIT queries, while \(\mathsf {FS}\) is faster than \(\mathsf {bFS}\) for GREEND and INFECTIOUS where forward scans are very short. Table 5 reports the ratio of the result pairs that \(\mathsf {bFS}\) outputs without conducting any comparisons. For joins on GREEND and INFECTIOUS, \(\mathsf {bFS}\) essentially operates similar to \(\mathsf {FS}\) but with the extra cost of creating and querying the bucket indices. In contrast, for the rest of the datasets, \(\mathsf {bFS}\) outputs from 48% to over 70% of the result pairs without any endpoint comparisons.

Table 6 Enhanced loop unrolling: percentage of the join results produced without endpoint comparisons.

Enhanced loop unrolling Among all four proposed optimizations, the enhanced loop unrolling is the most robust. As Fig. 7 shows, the technique is very effective when forward scans are long, i.e., for all queries in BOOKS, FLIGHTS, TAXIS and WEBKIT, while for highly selective joins with short scans, i.e., in GREEND, INFECTIOUS, it is less effective but almost never slows down the computation. The ratio of the result pairs which \(\mathsf {uFS}\) outputs without any endpoint comparisons supports this finding (see Table 6); note that even on the highly selective joins in GREEND and INFECTIOUS, \(\mathsf {uFS}\) directly outputs 50% or more of the results.

Decomposed data layout Last, our decomposed data layout exhibits similar behavior to grouping and bucket indexing. Essentially, long forward scans incur a large main memory footprint and hence, scanning a smaller in bytes dedicated array for \(\mathsf {start}\) endpoints can significantly reduce the cache misses. Under this, queries on BOOKS and WEBKIT benefit the most from applying \(\mathsf {dFS}\). In contrast, for GREEND and INFECTIOUS the extra cost of the decomposition does not payoff as data for the forward scans are already small enough to be handled in the highest levels of the cache.

Discussion Figure 7 also reports the execution time of \(\mathsf {bgudFS}\) which employs all four optimizations at the same time. We observe that on BOOKS, FLIGHTS, TAXIS and WEBKIT queries, \(\mathsf {bgudFS}\) clearly outperforms \(\mathsf {FS}\) and all its variants that employ a single optimization; this is expected as the proposed techniques optimize \(\mathsf {FS}\) in an orthogonal manner and so, can be effectively combined. Note that the performance gain of \(\mathsf {bgudFS}\) over the rest of the methods actually grows with |R|/|S|. On the other hand, for GREEND and INFECTIOUS queries, the method inherits the shortcomings of grouping, bucket indexing and decomposed data layout which renders \(\mathsf {bgudFS}\) the slowest method.

Our analysis on optimizing \(\mathsf {FS}\) draws two key conclusions. First, the enhanced loop unrolling which builds upon code transformation should be always applied; \(\mathsf {uFS}\) outperformed \(\mathsf {FS}\) in almost all our test queries. Second, the less selective and hence, more computationally expensive an interval join is, the more effective grouping, bucket indexing and decomposed data layout will be. Under these observations, the most efficient \(\mathsf {FS}\) variant is either \(\mathsf {bgudFS}\) or \(\mathsf {uFS}\), depending on the selectivity of the interval join.

6.3 \(\mathsf {optFS}\): a self-tuning \(\mathsf {FS}\)

To deal with this decision problem, we devised the \(\mathsf {optFS}\) method which operates in two phases. In the first phase, \(\mathsf {optFS}\) roughly estimates the average cost of a forward scan; we rely on sampling and executing \(\mathsf {uFS}\), for this purpose. In brief, we uniformly divide the domain into a predefined number of ranges (equal to 50) and let \(\mathsf {uFS}\) run on a sample from both inputs (equal to \(1\permille \)), inside every range; practically, a simplified and very fast version of \(\mathsf {uFS}\), which only counts the extent of the conducted forward scans, is executed. This sampling-based process manages to approximate the real value for the average forward scan extent with a \(18\%\) relative error, on average. Although we could improve the accuracy by increasing the number of ranges we divide the domain and/or the sampling ratio, our goal is different. We are interested only in estimating the order of magnitude for the forward scans extent; in this context, the discussed sampling-based process achieves almost an 100% accuracy. Our tests has shown that when forward scans cover only some tens (or a hundred in the worst case) of intervals on average then grouping, bucket indexing and the decomposed data layout will not payoff; i.e., the case of GREEND and INFECTIOUS queries. Based on this observation, \(\mathsf {optFS}\) decides whether to run \(\mathsf {uFS}\) or \(\mathsf {bgudFS}\) in its second phase. Note that the cost of the first (sampling and decision) phase of \(\mathsf {optFS}\) is negligible compared to the cost of the second phase (joining); in our tests, sampling and decision making took only \(3\permille \) of the total execution time by \(\mathsf {optFS}\), on average.

Fig. 8
figure 8

Comparisons: \(\mathsf {optFS}\) against competition

6.4 \(\mathsf {optFS}\) against the competition

After optimizing \(\mathsf {FS}\), we compare our \(\mathsf {optFS}\) against previous work, i.e., the partition-based methods \(\mathsf {DIP}\), \(\mathsf {OIP}\) and the state-of-the-art plane sweep method \(\mathsf {LEBI}\). For the competitor methods, we enforced traditional loop unrolling whenever was possible. In addition, we included the \(\mathsf {bgFS}\) method from our previous publication [5]. Figure 8 reports the execution times; as expected, the time of all methods rises while increasing the |R|/|S| ratio. Observe however that the plane sweep based methods \(\mathsf {LEBI}\), \(\mathsf {bgFS}\)-[5] and \(\mathsf {optFS}\) always outperform their partition-based competitors, in most cases by orders of magnitude with the exception of GREEND queries where DIP performance is very close to \(\mathsf {LEBI}\). This finding fully aligns with the analysis in [29], where \(\mathsf {LEBI}\) (and plane sweep based algorithms in general) was shown to outperform \(\mathsf {OIP}\).

For \(\mathsf {optFS}\) against \(\mathsf {LEBI}\), the tests clearly show that we achieved our original goal. Optimized \(\mathsf {FS}\) can be not only competitive to but also faster than state-of-the-art \(\mathsf {LEBI}\) which, as discussed in Sect. 3.1, performs no endpoint comparisons to produce the results. Also, we made this possible without relying on a special data structure such as the gapless hash map. In fact, \(\mathsf {optFS}\) outperforms \(\mathsf {LEBI}\) in 16 of the 24 queries in Fig. 8. For the highly selective joins on GREEND and INFECTIOUS, \(\mathsf {optFS}\) (powered by \(\mathsf {uFS}\)) is faster by a 70-82% margin, while for the least selective joins on BOOKS and WEBKIT, \(\mathsf {optFS}\) (powered by \(\mathsf {bgudFS}\)) outperforms \(\mathsf {LEBI}\) by a 13-36% margin. \(\mathsf {LEBI}\) steadily outperforms \(\mathsf {optFS}\) only on FLIGHTS by a 14-22% margin while on TAXIS the two methods have similar performance.

In terms of memory consumption, our preliminary analysis in [5] showed that \(\mathsf {LEBI}\) always incurs a larger memory footprint than \(\mathsf {bgFS}\), due to the data replication from its endpoint indices and maintaining open intervals inside two gapless hash maps. The same trend holds compared to \(\mathsf {optFS}\). As a code transformation, enhanced loop unrolling incurs no extra storage costs, while the decomposed data layout results into a 19% average increase over \(\mathsf {bgFS}\), when used, i.e., for queries in BOOKS, FLIGHTS, TAXIS and WEBKIT.

In view of these results, our analysis in the rest of this article will primarily focus on \(\mathsf {optFS}\) as the most efficient single-threaded method for interval joins.

7 Parallel processing

We now shift our focus to the parallel processing of interval joins that benefits from the existence of multiple CPU cores in a machine. We discuss three different solutions; (i) the case where no physical partitioning of the input collections is employed, (ii) the hash-based partitioning approach suggested in [29], and (iii) our domain-based partitioning approach. For the latter two approaches, we also discuss different strategies for efficiently partitioning the input intervals in Sect. 8.

7.1 No-partitioning parallel join

A straightforward approach to benefit from modern parallel hardware is to identify tasks of an interval join algorithm that are independent to each other and hence, can run in parallel. Every such task is assigned to a separate CPU core or thread. The input interval collections are never physically partitioned (hence, the name of the approach), which means that the processing threads need to simultaneously traverse data structures stored in shared main memory. A similar approach was used in the past for relational equi-joins, e.g., in [4], where a hash table is built in shared memory for the inner input and then, every thread reads a chunk of the outer and probes the shared hash table to produce join results.

Our experiments on single-threaded join computation clearly showed the advantage of plane sweep based evaluation and \(\mathsf {optFS}\) in specific. In what follows, we discuss a no-partitioning parallel adaptation of \(\mathsf {FS}\) and its variants.Footnote 7 Recall from Sect. 3.2 that the algorithm essentially involves two tasks; (i) advancing a sweep line which stops at the \(\mathsf {start}\) endpoint of all input intervals, and (ii) for each position of the sweep line, performing a forward scan to output join results. Despite traversing the same data structures, i.e., those containing the input collections, it is easy to confirm that the forward scans are independent from each other. Therefore, we design a parallel version of \(\mathsf {FS}\) which follows a master-slaves approach. We rely on a particular thread, which we call the master, to advance the sweep line, i.e., to execute Lines 4–5, 10–11 and 16 of Algorithm 2. When the sweep line stops, the master assigns the current forward scan to the next available thread (i.e., to a slave). Slave threads operate in a completely independent and asynchronous manner, executing instances of Lines 6–9 and 12–15 of Algorithm 2 in parallel. Note that all optimizations from Sect. 4 can be applied for parallel \(\mathsf {FS}\). Enhanced loop unrolling, decomposed data layout and bucket indexing are straightforward; for the latter, every slave thread will practically execute Lines 7–14 and 17–24 of Algorithm 4. For the grouping optimization, the master thread has to additionally create the groups (Lines 6 and 16 of Algorithm 3) but every group is then assigned to a slave thread which will first sort the group intervals according to their \(\mathsf {end}\) endpoint and then perform the forward scan; in other words, a slave thread executes an instance of Lines 7–13 and 17–23 of Algorithm 3, receiving a group of intervals as input.

figure f

7.2 Hash-based partitioning

In [29], Piatov et al. proposed a hash-based partitioning paradigm for parallelizing \(\mathsf {EBI}\) (and its lazy \(\mathsf {LEBI}\) version), described by Paradigm 1. The evaluation of the join involves two phases. First, the input collections are split into k disjoint partitions using the same hash function h. During the second phase, a pairwise join is performed between all \(\{R_1,\ldots ,R_k\}\) partitions of collection R and all \(\{S_1,\ldots ,S_k\}\) of S; in practice, any single-threaded interval join algorithm can be employed to join two partitions. Since the partitions are disjoint, the pairwise joins run independently of each other.

In [29], the intervals in the input collections are sorted by their \(\mathsf {start}\) endpoint before partitioning, and then assigned to partitions in a round-robin fashion, i.e., the i-th interval is assigned to partition \(h(i) = (i\) mod k). This causes the active tuple sets \(A^R\), \(A^S\) at each instance of the \(\mathsf {EBI}\) join to become small, because neighboring intervals are assigned to different partitions. As the cardinality of \(A^R\), \(A^S\) impacts the run time of \(\mathsf {EBI}\), each join in Line 9 is cheap. On the other hand, the intervals in each partition span the entire domain, meaning that the data in each partition are much sparser compared to the entire dataset. This causes Paradigm 1 to have an increased number of endpoint comparisons compared to a single-threaded algorithm, as k increases. In particular, recall that the basic cost of \(\mathsf {FS}\) and \(\mathsf {EBI}\) is the sweeping of the whole space, incurring \(|R|+|S|\) and \(2\cdot (|R|+|S|)\) comparisons, respectively. Under hash-based partitioning, \(k^2\) joins are executed in parallel, and each partition carries \(|R|/k+|S|/k\) intervals. Hence, the total basic cost becomes \(k\cdot (|R|+|S|)\) and \(2\cdot k\cdot (|R|+|S|)\), respectively (i.e., an increase by a factor of k).

7.3 Domain-based partitioning

Similar to Paradigm 1, our domain-based partitioning paradigm for parallel interval joins (Paradigm 2) involves two phases. The first phase (Lines 1–13) splits the domain uniformly into k non-overlapping stripes; a partition \(R_j\) (resp. \(S_j\)) is created for each domain stripe \(t_j\). Let \(t_{\mathsf {start}}\), \(t_{\mathsf {end}}\) denote the stripes that cover \(r.\mathsf {start}\), \(r.\mathsf {end}\) of an interval \(r \in R\), respectively. Interval r is first assigned to partition \(R_{\mathsf {start}}\) created for stripe \(t_{\mathsf {start}}\). Then, r is replicated across stripes \(t_{\mathsf {start}+1}\)...\(t_{\mathsf {end}}\). During the second phase (Lines 15–16), the domain-based paradigm computes \(R_j \bowtie S_j\) for every domain stripe \(t_j\), independently. To avoid producing duplicate results, a join result (rs) is reported if at least one of the involved intervals is not a replica. We can easily prove that if for both r and s the \(\mathsf {start}\) endpoint is not in \(t_j\), then r and s should also intersect in the previous stripe \(t_{j-1}\), therefore (rs) will be reported by another partition-join.

Fig. 9
figure 9

Domain-based partitioning of the intervals in Fig. 2; the case of 4 domain stripes \(t_1\ldots t_4\)

We show the difference between the two paradigms using Fig. 2; without loss of generality, assume that we are allocating 4 CPU threads for computing \(R\bowtie S\). To fully take advantage of parallelism, we assign each partition-join to a separate thread. Hence, the hash-based paradigm will first create \(\sqrt{4} = 2\) partitions for each input, i.e., \(R_1=\{r_1\}\), \(R_2=\{r_2\}\) for collection R and \(S_1=\{s_1,s_3,s_5\}\), \(S_2=\{s_2,s_4\}\) for S, and then evaluate pairwise joins \(R_1\bowtie S_1\), \(R_1\bowtie S_2\), \(R_2\bowtie S_1\) and \(R_2\bowtie S_2\). In contrast, the domain-based paradigm will first split the domain into the 4 disjoint stripes pictured in Fig. 9, and then assign and replicate (if needed) the intervals into 4 partitions for each collection; \(R_1 = \{r_{1}\}\), \(R_2 = \{\hat{r}_{1},r_{2}\}\), \(R_3 = \{\hat{r}_{1},\hat{r}_{2}\}\), \(R_4 = \{\hat{r}_{1}\}\) for R and \(S_1 = \{s_{1}\}\), \(S_2 = \{s_{2},s_{3}\}\), \(S_3 = \{\hat{s}_{3}\}\), \(S_4 = \{\hat{s}_{3}, s_4,s_5\}\) for S, where \(\hat{r}_j\) (resp. \(\hat{s}_j\)) denotes the replica of an interval \(r_i \in R\) (resp. \(s_i \in S\)) inside stripe \(t_j\). Last, the paradigm will compute partition-joins \(R_1\bowtie S_1\), \(R_2\bowtie S_2\), \(R_3\bowtie S_3\) and \(R_4\bowtie S_4\). Note that \(R_3\bowtie S_3\) will produce no results because all contents of \(R_3\) and \(S_3\) are replicas, while \(R_4\bowtie S_4\) will only produce \((r_1,s_4)\) but not \((r_1,s_3)\) which will be found in \(R_2\bowtie S_2\).

Our domain-based partitioning paradigm achieves a higher degree of parallelism compared to Paradigm 1, because for the same number of partitions it requires quadratically fewer joins. Also, as opposed to previous work that also applies domain-based partitioning (e.g., [9, 24]), we avoid the production and elimination of duplicate join results. On the other hand, long lived intervals that span a large number of stripes and skewed distributions of \(\mathsf {start}\) endpoints create joins of imbalanced costs. In what follows, we propose two orthogonal techniques that deal with load balancing.

figure g

7.3.1 Mini-joins and Greedy scheduling

Our first optimization of Paradigm 2 is based on decomposing the partition-join \(R_j \bowtie S_j\) for a domain stripe \(t_j\) into a number of mini-joins. The mini-joins can be executed independently (i.e., by a different thread) and bear different costs. Hence, they form tasks that can be greedily scheduled based on their cost estimates, in order to achieve load balancing.

Specifically, consider stripe \(t_j\) and let \(t_j.\mathsf {start}\) and \(t_j.\mathsf {end}\) be its endpoints. We distinguish between the following cases for an interval \(r \in R\) (resp. \(s \in S\)) which is in partition \(R_j\) (resp. \(S_j\)):

  1. (A)

    r starts inside \(t_j\), i.e., \(t_j.\mathsf {start} \le r.\mathsf {start} < t_j.\mathsf {end}\),

  2. (B)

    r starts inside a previous stripe but ends inside \(t_j\), i.e., \(r.\mathsf {start} < t_j.\mathsf {start}\) and \(r.\mathsf {end} < t_j.\mathsf {end}\), or

  3. (C)

    r starts inside a previous stripe and ends after \(t_j\), i.e., \(r.\mathsf {start} < t_j.\mathsf {start}\) and \(r.\mathsf {end} \ge t_j.\mathsf {end}\).

Note that in cases (B) and (C), r is assigned to partition \(R_j\) by replication (Lines 7–8 and 13–14 of Paradigm 2). We use \(R^{A}_j\), \(R^{B}_j\), and \(R^{C}_j\) (resp. \(S^{A}_j\), \(S^{B}_j\), and \(S^{C}_j\)) to denote the mini-partitions of \(R_j\) (resp. \(S_j\)) that correspond to the 3 cases above.

Under this, we can break partition-join \(R_j \bowtie S_j\) down to 9 distinct mini-joins; only 5 of these 9 need to be evaluated while the evaluation for 4 out of these 5 mini-joins is simplified. Specifically:

  • \(R_j^{A} \bowtie S_j^{A}\) is evaluated as normal; i.e, as discussed in Sections 3 and 4.

  • For \(R_j^{A} \bowtie S_j^{B}\) and \(R_j^{B} \bowtie S_j^{A}\), join algorithms only visit \(\mathsf {end}\) endpoints in \(S_j^{B}\) and \(R_j^{B}\), respectively; \(S_j^{B}\) and \(R_j^{B}\) only contain replicated intervals from previous stripes which are properly flagged to precede all intervals starting inside \(t_j\), and so, they form the sole group from \(S_j^{B}\) and \(R_j^{B}\) when the grouping optimization technique is used.

  • \(R_j^{A} \bowtie S_j^{C}\) and \(R_j^{C} \bowtie S_j^{A}\) reduce to cross-products, because replicas inside mini-partitions \(S_j^{C}\) and \(R_j^{C}\) span the entire stripe \(t_j\); hence, all interval pairs are directly output as results without any endpoint comparisons.

  • \(R_j^{B} \bowtie S_j^{B}\), \(R_j^{C} \bowtie S_j^{B}\), \(R_j^{C} \bowtie S_j^{B}\), \(R_j^{C} \bowtie S_j^{C}\) are not executed at all, as intervals from both inputs start in a previous stripe, and hence the results of these mini-joins would be duplicates.

Given a fixed number n of available CPU threads, i.e., partitioning of the domain into \(k=n\) stripes, our goal is to assign each of the \(1+5\cdot (k-1)\) in total mini-joinsFootnote 8 to a thread, in order to evenly distribute the load among all threads, or else to minimize the maximum load per thread. This is a well known NP-hard problem, which we opt to solve using a classic (\(4/3-1/3n\))-approximate algorithm [17] that has very good performance in practice. The algorithm greedily assigns to the CPU thread with currently the lowest load the next largest job. In details, we first estimate the cost of each mini-join; a straightforward approach for this is to consider the product of the cardinalities of the involved mini-partitions. Next, for each available thread p, we define its bag \(b_p\) that contains the mini-joins to be executed and its load \(\ell _p\) by adding up the estimated cost of the mini-joins in \(b_p\); initially, \(b_p\) is empty and \(\ell _p = 0\). We organize the bags in a min-priority queue \(\mathcal {Q}\) based on their load. Last, we examine all mini-joins in descending order of their estimated cost. For each mini-join say \(R_j^{A} \bowtie S_j^{A}\), we remove bag \(b_p\) at the top of \(\mathcal {Q}\) corresponding to thread p with the lowest load, we append \(R_j^{A} \bowtie S_j^{A}\) to \(b_p\) and re-insert the bag to \(\mathcal {Q}\). This greedy scheduling algorithm terminates after all mini-joins are appended to a bag.

Discussion and implementation details In practice, the greedy scheduling algorithm replaces an atomic assignment approach (Lines 15–16 of Paradigm 2) that would schedule each partition-join as a whole to the same thread. The breakdown of each partition-join task into mini-joins that can be executed at different CPU threads greatly improves load balancing in the case where the original tasks have big cost differences.

7.3.2 Adaptive partitioning

Our second adaptive partitioning technique for load balancing re-positions the borders between the \(\{t_1,\ldots ,t_k\}\) stripes, aiming at making the costs of all partition-joins on Line 16 in Paradigm 2 similar. Assuming a 1-1 assignment of partition-joins to CPU threads, load balancing can be achieved by finding the optimal k partitions that minimize the maximum partition-join cost. This can be modeled as the problem of defining a k-bins histogram with the minimum maximum error at each bin.Footnote 9 This problem can be solved exactly in PTIME with respect to the domain size, with the help of dynamic programming [20]; however, in our case the domain of the intervals is huge, so we resort to a heuristic that gives a good solution very fast. The time taken for partitioning should not dominate the cost of the join (otherwise, the purpose of a good partitioning is defeated). Our heuristic is reminiscent to local search heuristics for creating histograms in large domains that do not have quality guarantees but compute a good solution in practice within short time [30]. Note that, in practice, the overall execution time is dominated by the most expensive partition-join. Hence, given as input an initial set of stripes and partitions (more details in the next paragraph), we perform the following steps. First, the CPU thread or equivalently the stripe \(t_j\) that carries the highest load is identified. Then, we reduce \(t_j\)’s load (denoted as \(\ell _j\)) by moving consecutive intervals from \(R_j\) and \(S_j\) to the corresponding partitions of its neighbor stripe with the highest load, i.e., either \(t_{j-1}\) or \(t_{j+1}\), until \(\ell _{j-1} > \ell _j\) or \(\ell _{j+1} > \ell _j\) holds, respectively. Intuitively, this procedure corresponds to advancing endpoint \(t_j.\mathsf {start}\) or retreating \(t_j.\mathsf {end}\). Last, we continuously examine the thread with the highest load until no further moving of the load is possible.

The implementation of this heuristic raises two important challenges; (i) how we can quickly estimate the load on each of the \(n=k\) available CPU threads and (ii) what is the smallest unit of load (in other words, the smallest number of intervals) to be moved in between threads/stripes. To deal with both issues we build histogram statistics \(H^R\) and \(H^S\) for the input collections online, without extra scanning costs. In particular, we create a much finer partitioning of the domain by splitting it to a predefined number \(\xi \) of granules with \(\xi \) being a large multiple of k, i.e., \(\xi = \alpha \cdot k\), where \(\alpha>>1\). For each granule g, we count the number of intervals \(H^R[g]\) and \(H^S[g]\) from R and S, respectively, that start inside g. We define every initial stripe \(t_j\) as a set of consecutive \(\alpha \) granules; in practice, this partitions the input collections into stripes of equal widths as our original framework. Further, we select a granule as the smallest unit (number of intervals) to be moved between stripes. The load on each thread depends on the cost of the corresponding partition-join. This cost is optimized if we break it down into mini-joins, as described in Sect. 7.3.1, because numerous comparisons are saved. Empirically, we observed that the cost of the entire bundle of the 5 mini-joins for a stripe \(t_j\) is dominated by the first mini-join, i.e., \(R_j^{A}\bowtie S_j^{A}\), the cost of which can be estimated by \(|R_j^{A}|\cdot |S_j^{A}|\). Hence, in order to calculate \(|R_j^{A}|\) (resp. \(|S_j^{A}|\)), we can simply accumulate the counts \(H^R[g]\) (resp. \(H^S[g]\)) of all granules \(g\in t_j\). As the heuristic changes the boundaries of a stripe \(t_j\) by moving granules to/from \(t_j\), cardinalities \(|R_j^{A}|\), \(|S_j^{A}|\) and the join cost estimate for \(t_j\) can be incrementally updated very fast.

8 Strategies for parallel partitioning

We next elaborate on how the partitioning process can benefit from modern parallel hardware. We discuss three strategies applicable on both the hash-based and the domain-based partitioning; in the next section, we carefully evaluate these strategies for each partitioning type. As a common feature, all strategies operate in three phases. During the first phase, all available CPU cores or threads are employed to calculate the cardinality of each \(|R_j|\) and \(|S_j|\) partition. During the second phase, the threads are employed to allocate the space required to store every partition in main memory and then physically partition the input collections. Finally, again all available threads are used to sort and index (if needed) the input partitions, depending on the interval join algorithm to be used.Footnote 10 In the following, we detail the first two phases for each partitioning strategy.

\(\mathsf {One2One}\) The first strategy was used in [29] for hash-based partitioning but can be straightforwardly applied for the domain-based as well. The idea is to exclusively assign every \(R_j\) (resp. \(S_j\)) partition to a single thread.Footnote 11 Under this, the thread executes all phases of the partitioning process for \(R_j\). As every partition of the collection is assigned to exactly one thread, the entire partitioning process is essentially divided into smaller independent tasks which run in parallel without the need of synchronization. Strategy 1 illustrates a high-level pseudo-code of \(\mathsf {One2One}\). After initiating c parallel threads in Line 1, every thread executes the first and the second phase of the partitioning independently in Lines 3–8. Consider thread j. During the first phase in Lines 3–5, thread j is assigned \(\frac{k}{c}\) partitions for the input collection R, where k is the number of requested partitions and c is the number of available threads. Specifically, the thread gets all partitions in the range from \(\left( (j-1)\cdot \frac{k}{c}+1\right) \) to \(\left( j\cdot \frac{k}{c}\right) \) Then, it scans collection R to count how many intervals will be contained inside its assigned partitions. Last, during the second phase in Lines 6–8, every thread allocates the space needed to store their assigned partitions and then, scans for the second time the input collection to fill these partitions.

figure h

Despite its simplicity, the \(\mathsf {One2One}\) strategy has two important drawbacks. First, it requires multiple scans over the input; to be precise, the collection is scanned \(2\cdot c\) times. Second, the strategy cannot cope with skewed data distributions; essentially, the cost of the entire partitioning process is dominated by the cost of processing the largest partition. In what follows, we discuss two partitioning strategies that address these issues.

figure i

\(\mathsf {Temps}\) The key idea for fast partitioning is to assign parts of the input collection to the available threads instead of entire partitions. Under this, every thread reads a chunk from the input containing \(\frac{|R|}{c}\) intervals, and builds a temporary local partitioning. The input chunks should be disjoint such that the parallel threads operate completely independently. Every thread performs a first scan of its assigned intervals to count how large its local partitions will be, then allocates the required space in main memory and reads again the intervals to fill the partitions. Finally, after all threads have finished, the local partitionings are unified into the final result as the last step.

Strategy 2 illustrates a high-level pseudo-code of \(\mathsf {Temps}\). In Lines 2–7, every thread scans (two times) its assigned chunk of the input collection to create a local partitioning. Specifically, thread j gets the j-th chunk of \(\frac{|R|}{c}\) input intervals and produces local partitioning \(\{R_1^j,\ldots ,R_k^j\}\); notice that local partitionings contain the same number of partitions as the final result. To count the cardinality of its local partitions, the thread maintains private local counters \(\{|R_1^j|,\ldots ,|R_k^j|\}\). After all local partitionings are built (synchronization barrier in Line 8), \(\mathsf {Temps}\) unifies them by copying local partitions to a contiguous space allocated in main memory for the final partitions, in Lines 9–12. Both the hash-based and the domain-based partitioning assign every interval to exactly one local partition; the same holds for the replicas in case of domain-based. Under this, the cardinality for each final partition \(R_i\) is calculated as \(|R_i| = \sum _{j=1}^{c}|R_i^j|\) and the partition is defined as \(R_i^1\bigcup \ldots \bigcup R_i^c\), where c is the total number of parallel threads and local partitionings. Last, to accelerate this unification step, the \(\mathsf {Temps}\) strategy assigns the computation of every partition \(R_i\) to the next available thread in a round robin fashion.

figure j

Compared to \(\mathsf {One2One}\), the \(\mathsf {Temps}\) strategy scans the entire input collection R only twice as every thread now operates on a different chunk of R. In addition, as R’s chunks are equi-sized, i.e., all contain at most \(\frac{|R|}{c}\) intervals, the partitioning load is better distributed to the available threads. But, \(\mathsf {Temps}\) still exhibits important shortcomings. First, for every partition \(R_i\), the strategy allocates twice the required space in main memory, i.e., to store both its corresponding local partitions and \(R_i\) itself. Second, the strategy introduces an extra costly step, i.e., the unification of local partitioning. Also, the cost of this last step is dominated by the largest partition which is again computed by a single thread.

\(\mathsf {Divs}\) To address these shortcomings, we next discuss our last strategy. Strategy \(\mathsf {Divs}\) shares the same key idea to \(\mathsf {Temps}\), i.e., every thread j processes independently the j-th chunk of \(\frac{|R|}{c}\) input intervals. But, instead of building a temporary local partitioning, the thread directly updates the final partitions. For this purpose, the strategy logically divides every final partition \(R_i\) into c parts, i.e., one for each available thread. The extent of each \(R_i^j\) part is determined by local counters \(|R_i^j|\), which are computed similar to strategy \(\mathsf {Temps}\). With this division, each thread independently fills a dedicated part of \(R_i\)’s data structure in memory without the need of locking or any type of synchronization.

Strategy 3 illustrates a high-level pseudo-code of \(\mathsf {Divs}\). Lines 2 and 3 are identical to Strategy 2, i.e., a first scan of the input collection determines local counters \(\{|R_1^j|,\ldots ,|R_k^j|\}\) for each thread j. After local counters are computed (synchronization barrier in Line 5), \(\mathsf {Divs}\) allocates the necessary space in main memory to build every \(R_i\) partition (Lines 7–8) and also, logically divides \(R_i\) into c parts using its local counters (Line 9). Finally after this preparation step is finished for all partitions (synchronization barrier in Line 10), every thread scans for the second time its assigned input intervals and fills its dedicated part inside the data structure of every partition, in Lines 10–13.

Compared to \(\mathsf {Temps}\), the \(\mathsf {Divs}\) strategy does not allocate extra space for every partition; at the same time, the costly unification step of \(\mathsf {Temps}\) is entirely avoided. In addition, the largest partition which could become the bottleneck for both strategies \(\mathsf {One2One}\) and \(\mathsf {Temps}\) is now filled by multiple threads in parallel achieving a better load balancing.

9 Experiments on parallel processing

Last, we present the second part of our experimental evaluation, which focuses on the parallel computation of interval joins. In view of the results for single-threaded processing in Sect. 6, we next focus on \(\mathsf {optFS}\).

9.1 Setup

The experiments were conducted on the same machine used for the single-threaded tests in Sect. 6 with an identical setup, i.e., XOR workload, all data stored in main memory. Further, we chose to activate hyper-threading which allowed us to run up to 40 threads and used OpenMP for multi-threaded processing. Besides varying the |R|/|S| ratio inside \(\{0.25,0.5, 0.75,1\}\), we also increase the number of available parallel threads inside \(\{5,10,15,20,25,30,35,\) \(40\}\). We indicate the activation of hyper-threading by an h subscript, e.g., \(25_h\). Last, for the adaptive partitioning, we conducted a series of tests to determine the multiplicative factor \(\alpha \) which controls the number of granules in the fine partitioning of the domain (see Sect. 7.3.2). To avoid significantly increasing the partitioning cost, we ended up setting \(\alpha = 1000\) when the number of threads is less than 10, and \(\alpha = 100\) otherwise.

Fig. 10
figure 10

Tuning hash-based partitioning: strategies, \(|R|=|S|\) and 20 threads

Fig. 11
figure 11

Tuning hash-based partitioning: # partitions, \(|R|=|S|\) and 20 threads

9.2 Tuning hash-based partitioning

We first tune the hash-based paradigm. [29] sorts every collection prior to partitioning. We experimented with a variant of the paradigm which does not include such a pre-sort step and proved always faster. Hence, in the following we run our variant of the hash-based paradigm. Our analysis investigates which is the best strategy for the parallel partitioning of the inputs and how to select the number of partitions to be created.

9.2.1 Partitioning strategies

Fig. 10 reports the partitioning time of the \(\mathsf {One2One}\), \(\mathsf {Temps}\) and \(\mathsf {Divs}\) strategies while varying the number of partitions on our six datasets. For all tests, we set \(|R|=|S|\) and used up to 20 parallel threads to partition the input collections. The results clearly show that \(\mathsf {Divs}\) is both the most efficient and the most robust partitioning strategy, i.e., its time is little affected by the increase in the number of partitions. \(\mathsf {One2One}\) is competitive to \(\mathsf {Divs}\) only if each collection is split into 20 or more partitions. Recall that \(\mathsf {One2One}\) assigns each partition to exactly one thread, so with less than 20 partitions, some of the 20 available threads are never used. A key factor for understanding the differences in the performance of the strategies is the size of the inputs (see Table 1). GREEND and TAXIS contain more than 100m intervals; for these datasets, \(\mathsf {One2One}\) is always slower than both \(\mathsf {Temps}\) and \(\mathsf {Divs}\) due to scanning these big inputs multiple times while \(\mathsf {Temps}\) is always slower than \(\mathsf {Divs}\) due to creating and unifying local partitions. The rest of the datasets contain 2m or less intervals. Provided that at least 20 partitions are created, \(\mathsf {One2One}\) is always faster than \(\mathsf {Temps}\) because these partitions contain very few intervals and the overhead from local partitioning in \(\mathsf {Temps}\) becomes increasingly higher by the number of partitions.

9.2.2 Number of partitions

Piatov et al. [29] suggested that the hash-based paradigm performs at its best when each input is split into \(\sqrt{n}\) partitions, where n is the number of available threads. Under this, every available thread is assigned exactly one of the n in total partition-joins. Although we used this heuristic in our preliminary work [5], we investigate here in detail the impact of the number of partitions.

Figure 11 reports the breakdown of \(\mathsf {optFS}\) execution time while varying the number of partitions in each collection from 1 to 1, 000; note that the number of available parallel threads is fixed to 20. As expected, there is a tradeoff between the number of partitions and the total execution time. Initially, \(\mathsf {optFS}\) benefits from splitting each input into more partitions but the algorithm slows down when the number of partitions exceeds a particular value. However, our tests also unveil a correlation between the number of partitions and the selectivity of the join. For the highly selective queries in GREEND and INFECTIOUS, the execution time of \(\mathsf {optFS}\) is minimized when the number of partitions equals almost the number of available threads. On the other hand, for queries of low or medium selectivity, the heuristic from [29] is effective, i.e., the number of partitions should be set to \(\lfloor \sqrt{20}\rfloor = 4\). To understand this behavior, observe the time breakdown in Figures 11(c) and (d) when the number of partitions is set below 20, especially equal to 4. Different from all other cases, the total execution time is dominated by the sorting cost; the actual joining phase is very cheap due to the low number of results. Essentially, we can enhance sorting by splitting the inputs into more partitions which creates smaller sorting tasks to run in parallel.

Fig. 12
figure 12

Tuning domain-based partitioning: strategies, \(|R|=|S|\) and 20 threads

9.3 Tuning domain-based partitioning

We next tune our domain-based paradigm. Besides determining the best strategy for parallel partitioning and the number of partitions, we also study the impact of our load balancing techniques from Sect. 7.3.

9.3.1 Partitioning strategies

Fig. 12 reports the domain-based partitioning time for strategies \(\mathsf {One2One}\), \(\mathsf {Temps}\) and \(\mathsf {Divs}\) while varying the number of partitions; for the tests, we set again \(|R|=|S|\) and used up to 20 parallel threads. Also, adaptive partitioning from Sect. 7.3.2 was deactivated. Similar to Sect. 9.2.1, we observe that \(\mathsf {Divs}\) is the most efficient and most robust strategy for parallel partitioning; on the largest datasets GREEND and TAXIS, \(\mathsf {Temps}\) is competitive to \(\mathsf {Divs}\) but still slower. However, different to our hash-based analysis, \(\mathsf {One2One}\) is clearly the slowest strategy in all cases; its time is severely affected by the increase in the number of partitions exhibiting also a “staircase” pattern (more obvious in Figures 12(c) and (e)). The difference in \(\mathsf {One2One}\) ’s behavior is due to the higher processing cost per interval incurred by the domain-based partitioning compared to hash-based. This cost is amplified by the increase in the number of partitions. Recall that for hash-based partitioning, we only need to hash the \(\mathsf {start}\) endpoint of every interval. In contrast, for domain-based partitioning we also need to replicate an interval to all overlapping stripes; the replication cost naturally increases with the number of partitions. Regarding the “staircase” pattern, notice that \(\mathsf {One2One}\) ’s time essentially goes up every 20 partitions. Consider for example the increase from 20 to 40 partitions. At first, every thread builds exactly one partition. When we increase the number of partitions to 21, this extra partition will be assigned as a second task to one of the available threads. The total time of this thread will increase and dominate the overall partitioning time Adding more partitions will not change this overall time because there still threads assigned one partition unless the total number of partitions grows higher than 40.

9.3.2 Number of partitions

In [5], we always set the number of partitions equal to the number of threads such that each thread is assigned exactly one partition-join. To confirm the effectiveness of this heuristic, we measure the runtime of \(\mathsf {optFS}\) under the domain-based paradigm while varying the number of partitions from 1 to 1, 000. Similar to Sect. 9.2.2, the number of available threads is set to 20.

Figure 13 reports the results of our tests. The expected tradeoff between the execution time and the number of partitions from each collection is again observed. But, different from the hash-based paradigm, \(\mathsf {optFS}\) under the domain-based performs at its best when the number of partitions equals the number of available threads. An exception arises for the very selective joins; in INFECTIOUS, the lowest execution time is observed for around 100 partitions per input while in GREEND for over 100. Nevertheless, we can safely use the same heuristic even in these cases because (i) the average execution time for INFECTIOUS joins is extremely low (below 20 msec) even for 20 partitions while (ii) for GREEND, the time does not significantly drop when the number of partitions exceeds 20.

9.3.3 Load balancing

We now evaluate the load balancing achieved by the optimizations of domain-based partitioning of Section 7.3. To save space, we only show the results on WEBKIT; similar conclusions can be drawn for join queries on the other datasets. Apart from the overall execution time of each join, we also measured the load balancing among the participating CPU threads. Let set \(L = \{\ell _1\ldots \ell _n\}\) be the measured time spent by each of the available n threads; we define the average idle time as:

$$\begin{aligned} \vspace{-0.8ex}\frac{1}{n}\sum ^{n}_{j=1}{\{max(L)-\ell _j\}}\vspace{-0.8ex} \end{aligned}$$

A high average idle time means that the threads are under-utilized in general, whereas a low average idle time indicates that the load is balanced.

Fig. 13
figure 13

Tuning domain-based partitioning: # partitions, \(|R|=|S|\) and 20 threads

Fig. 14
figure 14

Tuning domain-based partitioning: load balancing, \(\mathsf {optFS}\) on WEBKIT

Table 7 Setups for partitioning-based computation

We experimented by activating or deactivating the mini-joins breakdown denoted by mj (Section 7.3.1), greedy scheduling denoted by greedy (Sect. 7.3.1), and adaptive partitioning denoted by adaptive (Section 7.3.2). We use the term atomic to denote the assignment of each partition-join or the bundle of its corresponding 5 mini-joins to the same thread, and uniform to denote the (non-adaptive) uniform initial partitioning of the domain. We tested the following setups:Footnote 12

  1. (1)

    uniform/atomic is the baseline domain-based para-digm of Sect. 7.3 with all load balancing optimization techniques deactivated;

  2. (2)

    atomic/adaptive is an extension to the baseline that employs only the adaptive partitioning;

  3. (3)

    uniform/mj+atomic splits each partition-join of the baseline into 5 mini-joins which are all executed by the same CPU thread;

  4. (4)

    adaptive/mj+atomic first applies the adaptive partitioning technique and then splits each partition-join into 5 mini-joins to be all executed by the same thread;

  5. (5)

    uniform/mj+greedy splits each partition-join of the baseline into 5 mini-joins which are greedily distributed to the available threads;

  6. (6)

    adaptive/mj+greedy employs all optimizations.

Figures 14(a), (c) report the total execution time for each setup (1)–(6), while Figures 14(b), (c) report the ratio of the average idle time over the execution time.

We observe the following. First, setups (2)–(6) all manage to enhance the parallel computation of the join. Their execution time is lower than the time of the uniform/atomic baseline. The most efficient setups always include the mj+greedy combination regardless of activating adaptive partitioning or not. In practice, splitting every partition-join into 5 mini-joins creates mini-jobs of varying costs (recall that 2 of them are cross-products and other 2 are also quite cheap), which facilitates the even partitioning of the total join cost to processors. For example, if one partition is heavier overall compared to the others, one thread would be dedicated to its most expensive mini-join and the other mini-joins would be handled by less loaded CPU threads. Also, notice that the mj optimization is beneficial even when the 5 defined mini-joins are all executed by the same CPU thread (i.e., uniform/mj+atomic), although the benefit is small compared to the other setups. This is because breaking down a partition-join into 5 mini-joins greatly reduces the overall cost of the partition-join (again, recall that 4 of the mini-joins are cheap).

Adaptive partitioning appears to have a smaller impact compared to the other two optimizations. Among the setups that do not employ the greedy scheduling, adaptive/atomic ranks first (both in terms of the execution time the average idle time ratio) but when activated on top of the uniform/mj+greedy setup, adaptive partitioning enhances the join computation when the number of threads is low, below 20; notice how faster is the adaptive/mj+greedy setup compared to uniform/mj+greedy in case of 5 available CPU threads.

Overall, we observe that (i) the mj optimization greatly reduces the cost of a partition-join and adds flexibility in load balancing, (ii) the uniform/mj+greedy and adaptive/mj+greedy setups perform very well in terms of load balancing, by reducing the average idle time of any thread to below 20% of the total execution time in almost all cases (\(|R|/|S| = 0.25\) and when less than 15 threads are available for uniform/mj+greedy are the only exceptions).

9.4 Comparisons

Table 7 summarizes the best setup for \(\mathsf {optFS}\) under the hash-based and the domain-based paradigms. Both paradigms use \(\mathsf {Divs}\) to efficiently partition the inputs. For hash-based, we set the number of partitions on the selectivity of the join, i.e., depending on whether \(\mathsf {optFS}\) acts as \(\mathsf {uFS}\) or \(\mathsf {bgudFS}\); for domain-based, we always set the number of partitions equal to the number of available CPU threads. Also, to take full advantage of all proposed load balancing optimizations, we setup the domain-based paradigm as adaptive/mj+greedy.

Fig. 15
figure 15

Comparing parallel processing solutions: \(\mathsf {optFS}\) speedup for \(|R|=|S|\)

We next compare all three approaches for the parallel computation of interval joins.Footnote 13 We first report in Fig. 15 the speedup over the single-threaded \(\mathsf {optFS}\) (Sect. 6), while varying the number of available CPU threads; to save space, we omit the results on FLIGHTS and INFECTIOUS since the findings are identical to TAXIS and GREEND, respectively. Overall, we observe that the domain-based paradigm is clearly the most efficient approach, being able to achieve the highest speedup in all cases. In fact, the performance advantage of the domain-based paradigm grows by the number of available threads. This is because the queries benefit increasingly more from domain-based’s ability to significantly reduce the number of endpoint comparisons conducted. In contrast, the number of comparisons under the hash-based paradigm increases, compared even to single-threaded \(\mathsf {optFS}\), as the number of available threads goes up.Footnote 14 Our tests also reveal the role of join selectivity. For the highly selective queries in GREEND and INFECTIOUS, the hash-based paradigm always outperforms no-partitioning, but for the low selectivity joins in BOOKS and WEBKIT, no-partitioning is competitive; in fact, for WEBKIT, it achieves always the second highest speedup. For queries of medium selectivity, i.e., in FLIGHTS and TAXIS, no-partitioning is able to incur a speedup only when up to 5 parallel threads are employed. To understand the behavior of no-partitioning \(\mathsf {optFS}\), we need to discuss two important shortcomings stemming from its master-slaves approach. The first problem is thread starvation; essentially, the master thread cannot create forward scan tasks fast enough for the slaves to run. This is the case with highly selective queries, where the forward scans are too short and hence cheap, as Fig. 6 shows. The second problem is the high number of cache misses incurred by all threads scanning the same data structures in main memory. This problem is amplified when increasing the number of CPU threads used as slaves.

Finally, we report in Fig. 16 the total execution time for each approach while varying the |R|/|S| ratio of the input collections; for these tests, we used up to 20 threads. As expected all approaches are affected by increasing the input size; their execution time rises. Nevertheless, the domain-based paradigm outperforms both the hash-based and no-partitioning in every test.

Fig. 16
figure 16

Comparing parallel processing solutions: \(\mathsf {optFS}\) running time for 20 threads

10 Conclusions and future work

In this paper, we targeted the efficient in-memory computation of interval overlap joins. Under single-threaded evaluation, we studied \(\mathsf {FS}\), a simple and efficient algorithm based on plane sweep that does not rely on any special data structures. We proposed four novel optimizations for \(\mathsf {FS}\) that greatly accelerate the algorithm in practice. Our experimental analysis showed that a self-tuning version of \(\mathsf {FS}\) which automatically selects and applies the most appropriate optimizations is competitive or even faster than the state-of-the-art. For parallel join evaluation, we proposed (i) a master-slaves approach that does not physically partition the inputs and (ii) a domain-based partitioning computation framework. Under the latter, each partition-join is broken down to five independent mini-joins which can be greedily assigned to the available CPU threads achieving a high degree of load balancing. Our experiments showed that our domain-based partitioning framework for parallel joins significantly outperforms both our no-partitioning approach and the hash-based framework of [29] while also scaling well with the number of available threads. In the future, we plan to study interval joins in stream processing. Also, we intend to investigate novel indexing structures for interval queries and joins.