Parallel computation of PDFs on big spatial data using Spark

Liu, Ji; Lemus, Noel Moreno; Pacitti, Esther; Porto, Fabio; Valduriez, Patrick

doi:10.1007/s10619-019-07260-3

Parallel computation of PDFs on big spatial data using Spark

Published: 21 February 2019

Volume 38, pages 63–100, (2020)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Ji Liu ORCID: orcid.org/0000-0002-9421-4100¹,
Noel Moreno Lemus²,
Esther Pacitti¹,
Fabio Porto² &
…
Patrick Valduriez¹

372 Accesses
2 Citations
Explore all metrics

Abstract

We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Fast DBSCAN Algorithm with Spark Implementation

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

Parallel SLINK for big data

Article 11 June 2019

Poonam Goyal, Sonal Kumari, … Navneet Goyal

Notes

The nth moment is defined as:
$$\begin{aligned} \mu _n = \sum _{i = 1}^m{(x_i - \mu )^n} \end{aligned}$$
(7)
where m is the number of values in the dataset, $x_i$ is the ith value in the dataset and $\mu $ is the mean value of the dataset.
Let $n_{window}$ be the number of points that have different sets of mean and standard values in a window and $N_{window}$ be the total number of points in a window. Then, $\alpha $ = $\frac{n_{window}}{N_{window}}$. We assume that there are l windows in a slice. When the points in different windows have different sets of mean and standard deviation values, $\beta $ = $\frac{l * n_{window}}{l * N_{window}}$ = $\alpha $. When k points have same sets of mean and standard deviation values in two different slices, $\beta $ = $\frac{l * n_{window} - k}{l * N_{window}} < \alpha $.

References

Campisano, R., Porto, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Spatial sequential pattern mining for seismic data. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 241–246 (2016)
Wang, F., Liu, J.: Networked wireless sensor data collection: issues, challenges, and approaches. IEEE Commun. Surv. Tutor. 13(4), 673–687 (2011)
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)
Article Google Scholar
Jackson, T.J., Vine, D.M.L., Hsu, A.Y., Oldak, A., Starks, P.J., Swift, C.T., Isham, J.D., Haken, M.: Soil moisture mapping at regional scales using microwave radiometry: the southern great plains hydrology experiment. IEEE Trans. Geosci. Remote Sens. 37(5), 2136–2151 (1999)
Article Google Scholar
Cressie, N.: Statistics for Spatial Data. Wiley, Hoboken (2015)
MATH Google Scholar
Fotheringham, S., Brunsdon, C., Charlton, M.: Quantitative Geography: Perspectives on Spatial Data Analysis. SAGE, London (2000)
MATH Google Scholar
Michele, C., Stefano, T., Andrea, S.: Sensitivity and uncertainty analysis in spatial modelling based on GIS. Agric. Ecosyst. Environ. 81(1), 71–79 (2000)
Article Google Scholar
Trajcevski, G.: Uncertainty in spatial trajectories. In: Computing with Spatial Trajectories, pp. 63–107 (2011)
Chapter Google Scholar
Kathryn, F., Oden, J.T., Faghihi, D.: A bayesian framework for adaptive selection, calibration, and validation of coarse-grained models of atomistic systems. J. Comput. Phys. 295, 189–208 (2015)
Article MathSciNet Google Scholar
Hpc geophysical simulation test suite. https://hpc4e.eu/downloads/hpc-geophysical-simulation-test-suite
Marelli, S., Sudret, B.: UQLab: A Framework for Uncertainty Quantification in MATLAB. ETH-Zürich (2014)
Prudencio, E.E., Schulz, K.W.: The parallel C++ statistical library ’QUESO’: Quantification of uncertainty for estimation, simulation and optimization. In: Euro-Par: Parallel Processing Workshops, pp. 398–407 (2011)
Chapter Google Scholar
Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)
Article Google Scholar
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE, pp. 1242–1244 (2013)
Suthaharan, S.: Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform. Eval. Rev. 41(4), 70–73 (2014)
Article Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Bohn, B., Garcke, J., Iza-Teran, R., Paprotny, A., Peherstorfer, B., Schepsmeier, U., Thole, C.: Analysis of car crash simulation data with nonlinear machine learning methods. In: International of the Conference on Computational Science ICCS, pp. 621–630 (2013)
Article Google Scholar
Gheisari, M., Wang, G., Bhuiyan,M.Z.A.: A survey on deep learning in big data. In: IEEE International of the Conference on Computational Science and Engineering, CSE, and IEEE International of the Conference on Embedded and Ubiquitous Computing, EUC, pp. 173–180 (2017)
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). (2010)
Liu, J., Pacitti, E., Valduriez, P.: A survey of scheduling frameworks in big data systems. In: International Journal of Cloud Computing, pp. 27 (2018)
Article Google Scholar
Chalabi, Y., Würtz, D.: Flexible distribution modeling with the generalized lambda distribution. (2012)
Karian, E.D.Z.: Fitting Statistical Distributions: The Generalized Lambda Distribution and Generalized Bootstrap Methods. Chapman and Hall/CRC, London (2000)
Book Google Scholar
Coile, R.V., Balomenos, G., Pandey, M., Caspeele, R., Criel, P., Wang, L., Alfred, S.: Computationally efficient estimation of the probability density function for the load bearing capacity of concrete columns exposed to fire. In: International Symposium of the International Association for Life-Cycle Civil Engineering (IALCCE), pp. 8 (2016)
del Val, J.R., Simmross-Wattenberg, F., Alberola-López, C.: libstable: fast, parallel, and high-precision computation of $\alpha $-stable distributions in R, C/C++, and matlab. J. Stat. Softw. 78(1), 1–25 (2017)
Google Scholar
Ballestra, L.V., Pacellib, G., Radi, D.: A very efficient approach to compute the first-passage probability density function in a time-changed brownian model: applications in finance. Phys. A 463(1), 330–344 (2016)
Article MathSciNet Google Scholar
Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965)
Article MathSciNet Google Scholar
Singer, S., Singer, S.: Complexity analysis of Nelder–Mead search iterations. In: Conference on Applied Mathematics and Computation, pp. 185–196 (1999)
Campisano, R., Borges, H., Porto, F., Perosi, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Discovering tight space-time sequences. In: International of the Conference on Big Data Analytics and Knowledge Discovery, pp. 247–257 (2018)
Chapter Google Scholar
Ramberg, J., Schmeiser, B.W.: An approximate method for generating asymmetric random variables. Commun. ACM 17(2), 78–82 (1974)
Article MathSciNet Google Scholar
Aldrich, J.: RA fisher and the making of maximum likelihood 1912–1922. Stat. Sci. 12(3), 162–176 (1997)
Article MathSciNet Google Scholar
Shalev-Shwatrz, S., Ben-David, S.: Understanding Machine Learning-From Theory to Algorithms. Cambridge University Press, Cambridge (2017)
Google Scholar
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: International of the Conference on Management of Data (SIGMOD), pp. 489–504 (2018)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Friedl, M., Brodley, C.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)
Ghemawat, S., Gobioff, H., Leung, S.: The google file system. In: ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)
Spark MLib. https://spark.apache.org/mllib/
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 24 (2015)
Article Google Scholar
Dixon, W. J., Massey, F. J.: Introduction to statistical analysis. (1968)
Lopes, R.H.C.: Kolmogorov-smirnov test. In: International of the Encyclopedia of Statistical Science, pp. 718–720 (2011)
Chapter Google Scholar
Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network file system. In: the Summer USENIX conference, pp. 119–130 (1985)
Harold, E.R.: Java I/O: Tips and Techniques for Putting I/O to Work, pp. 131–132 (2006)
Snyder, P.: tmpfs: a virtual memory file system. In: European UNIX Users Group Conference, pp. 241–248 (1990)
Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst Man Cybern. 21(3), 660–674 (1991)
Article MathSciNet Google Scholar
Belohlávek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees via concept lattices. Int. J. Gen. Syst. 38(4), 455–467 (2009)
Article MathSciNet Google Scholar
Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: International of the Conference on Machine Learning (ICML), pp. 609–616 (2001)
Marelli, S., Sudret, B.: Uqlab: A framework for uncertainty quantification in MATLAB. In: International of the Conference on Vulnerability, Risk Analysis and Management (ICVRAM), pp. 2554–2563 (2014)

Download references

Acknowledgements

This work was partially funded by EU H2020 Project HPC4e with MCTI/RNP-Brazil, CNPq, FAPERJ, and Inria Associated Team SciDISC. The experiments were carried out using a cluster at LNCC in Brazil and the Grid5000 testbed in France (https://www.grid5000.fr).

Author information

Authors and Affiliations

Inria and LIRMM, University of Montpellier, Montpellier, France
Ji Liu, Esther Pacitti & Patrick Valduriez
LNCC Petrópolis, Petrópolis, Brazil
Noel Moreno Lemus & Fabio Porto

Authors

Ji Liu
View author publications
You can also search for this author in PubMed Google Scholar
Noel Moreno Lemus
View author publications
You can also search for this author in PubMed Google Scholar
Esther Pacitti
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Porto
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Valduriez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Lemus, N.M., Pacitti, E. et al. Parallel computation of PDFs on big spatial data using Spark. Distrib Parallel Databases 38, 63–100 (2020). https://doi.org/10.1007/s10619-019-07260-3

Download citation

Published: 21 February 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10619-019-07260-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel computation of PDFs on big spatial data using Spark

Abstract

Access this article

Similar content being viewed by others

A Fast DBSCAN Algorithm with Spark Implementation

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

Parallel SLINK for big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel computation of PDFs on big spatial data using Spark

Abstract

Access this article

Similar content being viewed by others

A Fast DBSCAN Algorithm with Spark Implementation

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

Parallel SLINK for big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation