Abstract
We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.
Similar content being viewed by others
Notes
The nth moment is defined as:
$$\begin{aligned} \mu _n = \sum _{i = 1}^m{(x_i - \mu )^n} \end{aligned}$$(7)where m is the number of values in the dataset, \(x_i\) is the ith value in the dataset and \(\mu \) is the mean value of the dataset.
Let \(n_{window}\) be the number of points that have different sets of mean and standard values in a window and \(N_{window}\) be the total number of points in a window. Then, \(\alpha \) = \(\frac{n_{window}}{N_{window}}\). We assume that there are l windows in a slice. When the points in different windows have different sets of mean and standard deviation values, \(\beta \) = \(\frac{l * n_{window}}{l * N_{window}}\) = \(\alpha \). When k points have same sets of mean and standard deviation values in two different slices, \(\beta \) = \(\frac{l * n_{window} - k}{l * N_{window}} < \alpha \).
References
Campisano, R., Porto, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Spatial sequential pattern mining for seismic data. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 241–246 (2016)
Wang, F., Liu, J.: Networked wireless sensor data collection: issues, challenges, and approaches. IEEE Commun. Surv. Tutor. 13(4), 673–687 (2011)
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)
Jackson, T.J., Vine, D.M.L., Hsu, A.Y., Oldak, A., Starks, P.J., Swift, C.T., Isham, J.D., Haken, M.: Soil moisture mapping at regional scales using microwave radiometry: the southern great plains hydrology experiment. IEEE Trans. Geosci. Remote Sens. 37(5), 2136–2151 (1999)
Cressie, N.: Statistics for Spatial Data. Wiley, Hoboken (2015)
Fotheringham, S., Brunsdon, C., Charlton, M.: Quantitative Geography: Perspectives on Spatial Data Analysis. SAGE, London (2000)
Michele, C., Stefano, T., Andrea, S.: Sensitivity and uncertainty analysis in spatial modelling based on GIS. Agric. Ecosyst. Environ. 81(1), 71–79 (2000)
Trajcevski, G.: Uncertainty in spatial trajectories. In: Computing with Spatial Trajectories, pp. 63–107 (2011)
Kathryn, F., Oden, J.T., Faghihi, D.: A bayesian framework for adaptive selection, calibration, and validation of coarse-grained models of atomistic systems. J. Comput. Phys. 295, 189–208 (2015)
Hpc geophysical simulation test suite. https://hpc4e.eu/downloads/hpc-geophysical-simulation-test-suite
Marelli, S., Sudret, B.: UQLab: A Framework for Uncertainty Quantification in MATLAB. ETH-Zürich (2014)
Prudencio, E.E., Schulz, K.W.: The parallel C++ statistical library ’QUESO’: Quantification of uncertainty for estimation, simulation and optimization. In: Euro-Par: Parallel Processing Workshops, pp. 398–407 (2011)
Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE, pp. 1242–1244 (2013)
Suthaharan, S.: Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform. Eval. Rev. 41(4), 70–73 (2014)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Bohn, B., Garcke, J., Iza-Teran, R., Paprotny, A., Peherstorfer, B., Schepsmeier, U., Thole, C.: Analysis of car crash simulation data with nonlinear machine learning methods. In: International of the Conference on Computational Science ICCS, pp. 621–630 (2013)
Gheisari, M., Wang, G., Bhuiyan,M.Z.A.: A survey on deep learning in big data. In: IEEE International of the Conference on Computational Science and Engineering, CSE, and IEEE International of the Conference on Embedded and Ubiquitous Computing, EUC, pp. 173–180 (2017)
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). (2010)
Liu, J., Pacitti, E., Valduriez, P.: A survey of scheduling frameworks in big data systems. In: International Journal of Cloud Computing, pp. 27 (2018)
Chalabi, Y., Würtz, D.: Flexible distribution modeling with the generalized lambda distribution. (2012)
Karian, E.D.Z.: Fitting Statistical Distributions: The Generalized Lambda Distribution and Generalized Bootstrap Methods. Chapman and Hall/CRC, London (2000)
Coile, R.V., Balomenos, G., Pandey, M., Caspeele, R., Criel, P., Wang, L., Alfred, S.: Computationally efficient estimation of the probability density function for the load bearing capacity of concrete columns exposed to fire. In: International Symposium of the International Association for Life-Cycle Civil Engineering (IALCCE), pp. 8 (2016)
del Val, J.R., Simmross-Wattenberg, F., Alberola-López, C.: libstable: fast, parallel, and high-precision computation of \(\alpha \)-stable distributions in R, C/C++, and matlab. J. Stat. Softw. 78(1), 1–25 (2017)
Ballestra, L.V., Pacellib, G., Radi, D.: A very efficient approach to compute the first-passage probability density function in a time-changed brownian model: applications in finance. Phys. A 463(1), 330–344 (2016)
Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965)
Singer, S., Singer, S.: Complexity analysis of Nelder–Mead search iterations. In: Conference on Applied Mathematics and Computation, pp. 185–196 (1999)
Campisano, R., Borges, H., Porto, F., Perosi, F., Pacitti, E., Masseglia, F., Ogasawara, E.S.: Discovering tight space-time sequences. In: International of the Conference on Big Data Analytics and Knowledge Discovery, pp. 247–257 (2018)
Ramberg, J., Schmeiser, B.W.: An approximate method for generating asymmetric random variables. Commun. ACM 17(2), 78–82 (1974)
Aldrich, J.: RA fisher and the making of maximum likelihood 1912–1922. Stat. Sci. 12(3), 162–176 (1997)
Shalev-Shwatrz, S., Ben-David, S.: Understanding Machine Learning-From Theory to Algorithms. Cambridge University Press, Cambridge (2017)
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: International of the Conference on Management of Data (SIGMOD), pp. 489–504 (2018)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
Friedl, M., Brodley, C.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)
Ghemawat, S., Gobioff, H., Leung, S.: The google file system. In: ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)
Spark MLib. https://spark.apache.org/mllib/
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 24 (2015)
Dixon, W. J., Massey, F. J.: Introduction to statistical analysis. (1968)
Lopes, R.H.C.: Kolmogorov-smirnov test. In: International of the Encyclopedia of Statistical Science, pp. 718–720 (2011)
Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network file system. In: the Summer USENIX conference, pp. 119–130 (1985)
Harold, E.R.: Java I/O: Tips and Techniques for Putting I/O to Work, pp. 131–132 (2006)
Snyder, P.: tmpfs: a virtual memory file system. In: European UNIX Users Group Conference, pp. 241–248 (1990)
Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst Man Cybern. 21(3), 660–674 (1991)
Belohlávek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees via concept lattices. Int. J. Gen. Syst. 38(4), 455–467 (2009)
Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: International of the Conference on Machine Learning (ICML), pp. 609–616 (2001)
Marelli, S., Sudret, B.: Uqlab: A framework for uncertainty quantification in MATLAB. In: International of the Conference on Vulnerability, Risk Analysis and Management (ICVRAM), pp. 2554–2563 (2014)
Acknowledgements
This work was partially funded by EU H2020 Project HPC4e with MCTI/RNP-Brazil, CNPq, FAPERJ, and Inria Associated Team SciDISC. The experiments were carried out using a cluster at LNCC in Brazil and the Grid5000 testbed in France (https://www.grid5000.fr).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, J., Lemus, N.M., Pacitti, E. et al. Parallel computation of PDFs on big spatial data using Spark. Distrib Parallel Databases 38, 63–100 (2020). https://doi.org/10.1007/s10619-019-07260-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-019-07260-3