Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse,Electronics

当前位置： X-MOL 学术 › Electronics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse
Electronics ( IF 2.6 ) Pub Date : 2020-08-12 , DOI: 10.3390/electronics9081299
M. Asif Naeem , Habib Khan , Saad Aslam , Noreen Jamil

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.

中文翻译：

基于缓存的流相关联接的并行化，用于近实时数据仓库

近实时数据仓库是研究的重要领域，因为商业组织希望以最小的延迟来分析其商业销售。因此，由数据源生成的销售数据需要立即反映在数据仓库中。这需要在暂存区域中使用称为主数据的基于磁盘的关系对销售数据流进行近实时转换。为此，需要流关系联接。流关系联接中的主要问题是输入的不同性质。流数据快速且突发，而由于磁盘I / O成本高，基于磁盘的关系很慢。为了解决这个问题，文献中发布了一种著名的算法CACHEJOIN（缓存连接）。该算法有两个阶段，磁盘探测阶段和流探测阶段。这两个阶段顺序执行；这意味着由于两个阶段的顺序执行，流元组不必要地等待。这将算法限制为最佳利用CPU资源。在本文中，我们通过提出一种称为PCSRJ（并行化基于缓存的流关系联接）的健壮算法来解决此问题。新算法支持并行执行CACHEJOIN的磁盘探测阶段和流探测阶段。该算法在两个单独的节点上分配基于磁盘的关系，并在每个节点上并行执行CACHEJOIN。该算法还实现了根据关系的相关部分在每个节点上拆分流数据的策略。我们为PCSRJ开发了一种成本模型，并进行了经验验证。我们使用综合数据集比较了两种算法的服务率。

更新日期：2020-08-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11