Parallel computation of PDFs on big spatial data using Spark,Distributed and Parallel Databases

当前位置： X-MOL 学术 › Distrib. Parallel. Databases › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Parallel computation of PDFs on big spatial data using Spark
Distributed and Parallel Databases ( IF 1.2 ) Pub Date : 2019-02-21 , DOI: 10.1007/s10619-019-07260-3
Ji Liu , Noel Moreno Lemus , Esther Pacitti , Fabio Porto , Patrick Valduriez

We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.

中文翻译：

使用 Spark 在大空间数据上并行计算 PDF

我们考虑大空间数据，这些数据通常产生于地质或地震解释等科学领域。空间数据可以通过观察（例如使用传感器或土壤仪器）或数值模拟程序产生，并对应于代表 3D 土壤立方体区域的点。然而，信号处理和建模中的错误会产生一些不确定性，因此在识别地质或地震现象方面缺乏准确性。必须仔细分析这种不确定性。为了分析不确定性，主要的解决方案是计算空间立方体区域中每个点的概率密度函数 (PDF)。然而，在大空间数据上计算 PDF 可能非常耗时（在计算机集群上从几个小时到甚至几个月）。在本文中，我们提出了一种使用 Spark 并行计算此类 PDF 的新解决方案，采用三种方法：数据分组、机器学习预测和采样。我们使用数百 GB 到数 TB 的大数据，通过在不同计算机集群上的大量实验来评估我们的解决方案。实验结果表明，与基线方法相比，我们的解决方案可以很好地扩展，并且可以将执行时间减少 33 倍（以秒或分钟为单位）。

更新日期：2019-02-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>