Skip to main content
Log in

Parallel computation of PDFs on big spatial data using Spark

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. The nth moment is defined as:

    $$\begin{aligned} \mu _n = \sum _{i = 1}^m{(x_i - \mu )^n} \end{aligned}$$
    (7)

    where m is the number of values in the dataset, \(x_i\) is the ith value in the dataset and \(\mu \) is the mean value of the dataset.

  2. Let \(n_{window}\) be the number of points that have different sets of mean and standard values in a window and \(N_{window}\) be the total number of points in a window. Then, \(\alpha \) = \(\frac{n_{window}}{N_{window}}\). We assume that there are l windows in a slice. When the points in different windows have different sets of mean and standard deviation values, \(\beta \) = \(\frac{l * n_{window}}{l * N_{window}}\) = \(\alpha \). When k points have same sets of mean and standard deviation values in two different slices, \(\beta \) = \(\frac{l * n_{window} - k}{l * N_{window}} < \alpha \).

References

  1. Campisano, R., Porto, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Spatial sequential pattern mining for seismic data. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 241–246 (2016)

  2. Wang, F., Liu, J.: Networked wireless sensor data collection: issues, challenges, and approaches. IEEE Commun. Surv. Tutor. 13(4), 673–687 (2011)

    Article  Google Scholar 

  3. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)

    Article  Google Scholar 

  4. Jackson, T.J., Vine, D.M.L., Hsu, A.Y., Oldak, A., Starks, P.J., Swift, C.T., Isham, J.D., Haken, M.: Soil moisture mapping at regional scales using microwave radiometry: the southern great plains hydrology experiment. IEEE Trans. Geosci. Remote Sens. 37(5), 2136–2151 (1999)

    Article  Google Scholar 

  5. Cressie, N.: Statistics for Spatial Data. Wiley, Hoboken (2015)

    MATH  Google Scholar 

  6. Fotheringham, S., Brunsdon, C., Charlton, M.: Quantitative Geography: Perspectives on Spatial Data Analysis. SAGE, London (2000)

    MATH  Google Scholar 

  7. Michele, C., Stefano, T., Andrea, S.: Sensitivity and uncertainty analysis in spatial modelling based on GIS. Agric. Ecosyst. Environ. 81(1), 71–79 (2000)

    Article  Google Scholar 

  8. Trajcevski, G.: Uncertainty in spatial trajectories. In: Computing with Spatial Trajectories, pp. 63–107 (2011)

    Chapter  Google Scholar 

  9. Kathryn, F., Oden, J.T., Faghihi, D.: A bayesian framework for adaptive selection, calibration, and validation of coarse-grained models of atomistic systems. J. Comput. Phys. 295, 189–208 (2015)

    Article  MathSciNet  Google Scholar 

  10. Hpc geophysical simulation test suite. https://hpc4e.eu/downloads/hpc-geophysical-simulation-test-suite

  11. Marelli, S., Sudret, B.: UQLab: A Framework for Uncertainty Quantification in MATLAB. ETH-Zürich (2014)

  12. Prudencio, E.E., Schulz, K.W.: The parallel C++ statistical library ’QUESO’: Quantification of uncertainty for estimation, simulation and optimization. In: Euro-Par: Parallel Processing Workshops, pp. 398–407 (2011)

    Chapter  Google Scholar 

  13. Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)

    Article  Google Scholar 

  14. Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE, pp. 1242–1244 (2013)

  15. Suthaharan, S.: Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform. Eval. Rev. 41(4), 70–73 (2014)

    Article  Google Scholar 

  16. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  17. Bohn, B., Garcke, J., Iza-Teran, R., Paprotny, A., Peherstorfer, B., Schepsmeier, U., Thole, C.: Analysis of car crash simulation data with nonlinear machine learning methods. In: International of the Conference on Computational Science ICCS, pp. 621–630 (2013)

    Article  Google Scholar 

  18. Gheisari, M., Wang, G., Bhuiyan,M.Z.A.: A survey on deep learning in big data. In: IEEE International of the Conference on Computational Science and Engineering, CSE, and IEEE International of the Conference on Embedded and Ubiquitous Computing, EUC, pp. 173–180 (2017)

  19. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). (2010)

  20. Liu, J., Pacitti, E., Valduriez, P.: A survey of scheduling frameworks in big data systems. In: International Journal of Cloud Computing, pp. 27 (2018)

    Article  Google Scholar 

  21. Chalabi, Y., Würtz, D.: Flexible distribution modeling with the generalized lambda distribution. (2012)

  22. Karian, E.D.Z.: Fitting Statistical Distributions: The Generalized Lambda Distribution and Generalized Bootstrap Methods. Chapman and Hall/CRC, London (2000)

    Book  Google Scholar 

  23. Coile, R.V., Balomenos, G., Pandey, M., Caspeele, R., Criel, P., Wang, L., Alfred, S.: Computationally efficient estimation of the probability density function for the load bearing capacity of concrete columns exposed to fire. In: International Symposium of the International Association for Life-Cycle Civil Engineering (IALCCE), pp. 8 (2016)

  24. del Val, J.R., Simmross-Wattenberg, F., Alberola-López, C.: libstable: fast, parallel, and high-precision computation of \(\alpha \)-stable distributions in R, C/C++, and matlab. J. Stat. Softw. 78(1), 1–25 (2017)

    Google Scholar 

  25. Ballestra, L.V., Pacellib, G., Radi, D.: A very efficient approach to compute the first-passage probability density function in a time-changed brownian model: applications in finance. Phys. A 463(1), 330–344 (2016)

    Article  MathSciNet  Google Scholar 

  26. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965)

    Article  MathSciNet  Google Scholar 

  27. Singer, S., Singer, S.: Complexity analysis of Nelder–Mead search iterations. In: Conference on Applied Mathematics and Computation, pp. 185–196 (1999)

  28. Campisano, R., Borges, H., Porto, F., Perosi, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Discovering tight space-time sequences. In: International of the Conference on Big Data Analytics and Knowledge Discovery, pp. 247–257 (2018)

    Chapter  Google Scholar 

  29. Ramberg, J., Schmeiser, B.W.: An approximate method for generating asymmetric random variables. Commun. ACM 17(2), 78–82 (1974)

    Article  MathSciNet  Google Scholar 

  30. Aldrich, J.: RA fisher and the making of maximum likelihood 1912–1922. Stat. Sci. 12(3), 162–176 (1997)

    Article  MathSciNet  Google Scholar 

  31. Shalev-Shwatrz, S., Ben-David, S.: Understanding Machine Learning-From Theory to Algorithms. Cambridge University Press, Cambridge (2017)

    Google Scholar 

  32. Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: International of the Conference on Management of Data (SIGMOD), pp. 489–504 (2018)

  33. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  34. Friedl, M., Brodley, C.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)

    Article  Google Scholar 

  35. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)

  36. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)

  37. Ghemawat, S., Gobioff, H., Leung, S.: The google file system. In: ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)

  38. Spark MLib. https://spark.apache.org/mllib/

  39. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 24 (2015)

    Article  Google Scholar 

  40. Dixon, W. J., Massey, F. J.: Introduction to statistical analysis. (1968)

  41. Lopes, R.H.C.: Kolmogorov-smirnov test. In: International of the Encyclopedia of Statistical Science, pp. 718–720 (2011)

    Chapter  Google Scholar 

  42. Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network file system. In: the Summer USENIX conference, pp. 119–130 (1985)

  43. Harold, E.R.: Java I/O: Tips and Techniques for Putting I/O to Work, pp. 131–132 (2006)

  44. Snyder, P.: tmpfs: a virtual memory file system. In: European UNIX Users Group Conference, pp. 241–248 (1990)

  45. Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst Man Cybern. 21(3), 660–674 (1991)

    Article  MathSciNet  Google Scholar 

  46. Belohlávek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees via concept lattices. Int. J. Gen. Syst. 38(4), 455–467 (2009)

    Article  MathSciNet  Google Scholar 

  47. Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: International of the Conference on Machine Learning (ICML), pp. 609–616 (2001)

  48. Marelli, S., Sudret, B.: Uqlab: A framework for uncertainty quantification in MATLAB. In: International of the Conference on Vulnerability, Risk Analysis and Management (ICVRAM), pp. 2554–2563 (2014)

Download references

Acknowledgements

This work was partially funded by EU H2020 Project HPC4e with MCTI/RNP-Brazil, CNPq, FAPERJ, and Inria Associated Team SciDISC. The experiments were carried out using a cluster at LNCC in Brazil and the Grid5000 testbed in France (https://www.grid5000.fr).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ji Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Lemus, N.M., Pacitti, E. et al. Parallel computation of PDFs on big spatial data using Spark. Distrib Parallel Databases 38, 63–100 (2020). https://doi.org/10.1007/s10619-019-07260-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-019-07260-3

Keywords

Navigation