BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2022-11-15 , DOI: 10.1016/j.future.2022.11.016
Hangxu Ji , Su Jiang , Yuhai Zhao , Gang Wu , Guoren Wang , George Y. Yuan

The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

中文翻译：

BS-Join：一种新颖高效的 Flink 时空数据管理混合批流连接方法

新的计算模型，混合批流数据处理，在大时空数据管理中起着至关重要的作用。混合批流数据连接作为上述计算方式的核心，由于两类数据源并存，对吞吐量和延迟有很高的要求。Apache Flink是最适合批流混合数据join的分布式系统，比基于Hadoop和Spark的join计算模型延迟更低，模拟远程实时读取批量数据源，通过DataStream API完成join计算. 但是随着并行度的增加，频繁的远程数据读取会造成巨大的磁盘和通信压力，从而降低作业效率和可扩展性。为了让事情变得更棘手，在模拟范围连接等复杂操作时，上述影响会进一步放大。针对上述困难和混合批流数据连接的特点，提出了一种基于缓存的原生支持混合批流连接计算的框架，通过在批量数据源中建立索引来提高数据连接过程中的搜索速度. 同时，针对equijoin和range join，分别提出了基于热点感知的优化机制和基于skip list的优化机制，进一步提高作业效率。综上所述，我们工作的优势突出如下： (1) 提出的框架使 Flink 能够原生支持混合批流数据连接，并且可以提高 5 倍的吞吐量和 4 倍的加速；(2) 基于热点感知的优化机制，可以进一步提高equijoin的效率；(3) Flink 与传统 Operator 的范围查询相比，吞吐量提升 6 倍，时延降低 45%。

更新日期：2022-11-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>