A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark.,The Journal of Supercomputing

当前位置： X-MOL 学术 › J. Supercomput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark.
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2021-07-01 , DOI: 10.1007/s11227-021-03946-7
Hansub Shin ₁ , Kisung Lee ₂ , Hyuk-Yoon Kwon ₁

Affiliation

With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial information that needs to be effectively and efficiently managed. Even though there are several distributed spatial data processing systems such as GeoSpark (Apache Sedona), the effects of underlying storage engines have not been well studied for spatial data processing. In this paper, we evaluate the performance of various distributed storage engines for processing large-scale spatial data using GeoSpark, a state-of-the-art distributed spatial data processing system running on top of Apache Spark. For our performance evaluation, we choose three distributed storage engines having different characteristics: (1) HDFS, (2) MongoDB, and (3) Amazon S3. To conduct our experimental study on a real cloud computing environment, we utilize Amazon EMR instances (up to 6 instances) for distributed spatial data processing. For the evaluation of big spatial data processing, we generate data sets considering four kinds of various data distributions and various data sizes up to one billion point records (38.5GB raw size). Through the extensive experiments, we measure the processing time of storage engines with the following variations: (1) sharding strategies in MongoDB, (2) caching effects, (3) data distributions, (4) data set sizes, (5) the number of running executors and storage nodes, and (6) the selectivity of queries. The major points observed from the experiments are summarized as follows. (1) The overall performance of MongoDB-based GeoSpark is degraded compared to HDFS- and S3-based GeoSpark in our experimental settings. (2) The performance of MongoDB-based GeoSpark is relatively improved in large-scale data sets compared to the others. (3) HDFS- and S3-based GeoSpark are more scalable to running executors and storage nodes compared to MongoDB-based GeoSpark. (4) The sharding strategy based on the spatial proximity significantly improves the performance of MongoDB-based GeoSpark. (5) S3- and HDFS-based GeoSpark show similar performances in all the environmental settings. (6) Caching in distributed environments improves the overall performance of spatial data processing. These results can be usefully utilized in decision-making of choosing the most adequate storage engine for big spatial data processing in a target distributed environment.

中文翻译：

基于GeoSpark的大空间数据处理分布式存储引擎对比实验研究

随着配备 GPS 的移动设备数量的增加，我们目睹了大量需要有效管理的空间信息。尽管有几个分布式空间数据处理系统，例如 GeoSpark (Apache Sedona)，但底层存储引擎对空间数据处理的影响还没有得到很好的研究。在本文中，我们评估了使用 GeoSpark 处理大规模空间数据的各种分布式存储引擎的性能，GeoSpark 是一种运行在 Apache Spark 之上的最先进的分布式空间数据处理系统。对于我们的性能评估，我们选择了三个具有不同特性的分布式存储引擎：（1）HDFS，（2）MongoDB，和（3）Amazon S3。为了在真实的云计算环境中进行我们的实验研究，我们利用 Amazon EMR 实例（最多 6 个实例）进行分布式空间数据处理。对于大空间数据处理的评估，我们生成的数据集考虑了四种不同的数据分布和各种数据大小，最高可达 10 亿点记录（38.5GB 原始大小）。通过广泛的实验，我们测量了存储引擎的处理时间，有以下变化：(1) MongoDB 中的分片策略，(2) 缓存效果，(3) 数据分布，(4) 数据集大小，(5) 数量运行的执行器和存储节点，以及（6）查询的选择性。从实验中观察到的要点总结如下。(1) 在我们的实验设置中，与基于 HDFS 和 S3 的 GeoSpark 相比，基于 MongoDB 的 GeoSpark 的整体性能有所下降。(2) 基于 MongoDB 的 GeoSpark 在大规模数据集上的性能相对于其他方法有所提高。(3) 与基于 MongoDB 的 GeoSpark 相比，基于 HDFS 和 S3 的 GeoSpark 对运行执行器和存储节点的可扩展性更高。(4) 基于空间邻近度的分片策略显着提高了基于 MongoDB 的 GeoSpark 的性能。(5) 基于 S3 和 HDFS 的 GeoSpark 在所有环境设置中表现出相似的性能。(6) 分布式环境中的缓存提高了空间数据处理的整体性能。这些结果可用于在目标分布式环境中为大空间数据处理选择最合适的存储引擎的决策。(3) 与基于 MongoDB 的 GeoSpark 相比，基于 HDFS 和 S3 的 GeoSpark 对运行执行器和存储节点的可扩展性更高。(4) 基于空间邻近度的分片策略显着提高了基于 MongoDB 的 GeoSpark 的性能。(5) 基于 S3 和 HDFS 的 GeoSpark 在所有环境设置中表现出相似的性能。(6) 分布式环境中的缓存提高了空间数据处理的整体性能。这些结果可用于在目标分布式环境中为大空间数据处理选择最合适的存储引擎的决策。(3) 与基于 MongoDB 的 GeoSpark 相比，基于 HDFS 和 S3 的 GeoSpark 对运行执行器和存储节点的可扩展性更高。(4) 基于空间邻近度的分片策略显着提高了基于 MongoDB 的 GeoSpark 的性能。(5) 基于 S3 和 HDFS 的 GeoSpark 在所有环境设置中表现出相似的性能。(6) 分布式环境中的缓存提高了空间数据处理的整体性能。这些结果可用于在目标分布式环境中为大空间数据处理选择最合适的存储引擎的决策。(5) 基于 S3 和 HDFS 的 GeoSpark 在所有环境设置中表现出相似的性能。(6) 分布式环境中的缓存提高了空间数据处理的整体性能。这些结果可用于在目标分布式环境中为大空间数据处理选择最合适的存储引擎的决策。(5) 基于 S3 和 HDFS 的 GeoSpark 在所有环境设置中表现出相似的性能。(6) 分布式环境中的缓存提高了空间数据处理的整体性能。这些结果可用于在目标分布式环境中为大空间数据处理选择最合适的存储引擎的决策。

更新日期：2021-07-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文