Abstract
Numerous scientific disciplines have witnessed tremendous growth in the amount of spatial data produced over the past decade. To handle the volume and velocity of such data, researchers have embraced distributed systems, which partition data among multiple nodes to provide scalability and high availability. Previous work on partitioning large spatiotemporal data focuses on bulk-ingestion and static partitioning, hence is unable to handle dynamic data and querying workloads which is common for real-time data. In this paper we develop GeoBalance as a workload-aware partitioning approach for spatiotemporal data that can adapt partitions on-the-fly without disrupting the data ingestion/retrieval process. GeoBalance employs a spatial evolutionary algorithm to incrementally tune the partitions according to a geo-aware partitioning fitness function. In addition, we perform a rolling migration from one partitioning scheme to another to ensure that data ingestion and retrieval is not compromised during the partition change period. We conduct multiple experiments using a write-intensive hybrid workload of Twitter data and random hotspots, to demonstrate that the GeoBalance partitioning approach outperforms statically defined partitions and other partitioning algorithms such as k-d tree.
Similar content being viewed by others
Notes
Resourcing Open Geo-spatial Education and Research
References
Miller HJ, Goodchild MF (2015) Data-driven geography. GeoJournal 80(4):449–461. https://doi.org/10.1007/s10708-014-9602-6
Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani M, Elmeleegy H, Qadah T (2015) Aqwa: adaptive query workload aware partitioning of big spatial data. Proc VLDB Endowment 8(13):2062–2073
Kleppmann M (2017) Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. ” O’Reilly Media, Inc.”
Soliman A, Soltani K, Yin J, Padmanabhan A, Wang S (2017) Social sensing of urban land use based on analysis of twitter users mobility patterns. PloS one 12(7):e0181657
Kamath KY, Caverlee J, Cheng Z, Sui DZ (2012) Spatial influence vs. community influence: modeling the global spread of social media. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 962–971
Wang S, Hu H, Lin T, Liu Y, Padmanabhan A, Soltani K (2015) Cybergis for data-intensive knowledge discovery. SIGSPATIAL Special 6 (2):26–33. https://doi.org/10.1145/2744700.2744704
Eldawy A, Mokbel MF (2015) The era of big spatial data: Challenges and opportunities. In: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02, MDM ’15. IEEE Computer Society, Washington, pp 7–10. https://doi.org/10.1109/MDM.2015.82
Fox A, Eichelberger C, Hughes J, Lyon S (2013) Spatio-temporal indexing in non-relational distributed databases. In: 2013 IEEE International Conference on Big Data, pp 291–299
Malensek M, Pallickara S, Pallickara S (2016) Autonomous cloud federation for high-throughput queries over voluminous datasets. IEEE Cloud Comput 3(3):40–49
Eldawy A (2014) Spatialhadoop: Towards flexible and scalable spatial processing using mapreduce. In: Proceedings of the 2014 SIGMOD PhD Symposium, SIGMOD’14 PhD Symposium. ACM, New York, pp 46–50. https://doi.org/10.1145/2602622.2602625
Serafini M, Taft R, Elmore AJ, Pavlo A, Aboulnaga A, Stonebraker M (2016) Clay: fine-grained adaptive partitioning for general database schemas. Proc VLDB Endowment 10(4):445–456
Arzuaga E, Kaeli DR (2010) Quantifying load imbalance on virtualized enterprise servers. In: Proceedings of the First Joint WOSP/SIPEW International Conference on Performance Engineering, WOSP/SIPEW ’10. ACM, New York, pp 235–242. https://doi.org/10.1145/1712605.1712641
Niemeyer G (2008) Geohash
Malensek M, Pallickara S, Pallickara S (2013) Polygon-based query evaluation over geospatial data using distributed hash tables. In: 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing (UCC), pp 219–226
Liu YY, Cho WKT, Wang S (2016) Pear: a massively parallel evolutionary computation approach for political redistricting optimization and analysis. Swarm Evol Comput 30:78–92. https://doi.org/10.1016/j.swevo.2016.04.004
Kini A, Emanuele R (2014) Geotrellis: Adding geospatial capabilities to spark. Spark Summit
Yu J, Wu J, Sarwat M (2015) Geospark: A cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 70
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10
Nishimura S, Das S, Agrawal D, El Abbadi A (2013) ∖mathcal {MD}-hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Databases 31(2):289–319
Taft R, Mansour E, Serafini M, Duggan J, Elmore AJ, Aboulnaga A, Pavlo A, Stonebraker M (2014) E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proc VLDB Endowment 8 (3):245–256
Curino C, Jones E, Zhang Y, Madden S (2010) Schism: A workload-driven approach to database replication and partitioning. Proc VLDB Endow. 3 (1-2):48–57. https://doi.org/10.14778/1920841.1920853
Ghosh M, Xu L, Qian X, Kao T, Gupta I, Gupta H (2016) Getafix: Workload-aware distributed interactive analytics. UIUC Ideals
Jindal A, Dittrich J (2011) Relax and let the database do the partitioning online. In: International Workshop on Business Intelligence for the Real-Time Enterprise. Springer, pp 65–80
Pavlo A, Curino C, Zdonik S (2012) Skew-aware automatic database partitioning in shared-nothing, parallel oltp systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12. ACM, New Yorkpp 61–72. https://doi.org/10.1145/2213836.2213844
Quamar A, Kumar KA, Deshpande A (2013) Sword: scalable workload-aware data placement for transactional workloads. In: Proceedings of the 16th International Conference on Extending Database Technology. ACM, pp 430–441
Wu X, Murray AT (2008) A new approach to quantifying spatial contiguity using graph theory and spatial interaction. Int J Geogr Inf Sci 22(4):387–407
Tzoumas K, Yiu ML, Jensen CS (2009) Workload-aware indexing of continuously moving objects. Proc VLDB Endowment 2(1):1186–1197
Achakeev D, Seeger B, Widmayer P (2012) Sort-based query-adaptive loading of r-trees. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2080–2084
DeRose L, Homer B, Johnson D (2007) Detecting application load imbalance on high end massively parallel systems. In: Proceedings of the 13th International Euro-Par Conference on Parallel Processing, Euro-Par’07. Springer, Berlin, pp 150–159. http://dl.acm.org/citation.cfm?id=2391541.2391560
Kai CAO, Boa HUANG (2010) Comparison of spatial compactness evaluation methods for simple genetic algorithm based land use planning optimization problem. In: Proceedings of the Joint International Conference on Theory, Data Handling and Modelling in GeoSpatial Information Science, pp 26–28
Beasley D, Bull DR, Martin RR (1993) An overview of genetic algorithms: Part 1, fundamentals. Univ Comput 15(2):58–69
Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatialhadoop. Proc VLDB Endow 8(12):1602–1605. https://doi.org/10.14778/2824032.2824057
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proc VLDB Endow 6(11):1009–1020. https://doi.org/10.14778/2536222.2536227
Gupta A, Yang F, Govig J, Kirsch A, Chan K, Lai K, Wu S, Dhoot S, Kumar AR, Agiwal A, Bhansali S, Hong M, Cameron J, Siddiqi M, Jones D, Shute J, Gubarev A, Venkataraman S, Agrawal D (2016) Mesa: A geo-replicated online data warehouse for google’s advertising system. Commun ACM 59(7):117–125. https://doi.org/10.1145/2936722
Marz N, Warren J (2015) Big data: Principles and best practices of scalable realtime data systems. Manning Publications Co.
Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10. USENIX Association, Berkeley, pp 11–11. http://dl.acm.org/citation.cfm?id=1855840.1855851
O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (lsm-tree). Acta Inf 33(4):351–385. https://doi.org/10.1007/s002360050048
Kim YS, Kim T, Carey MJ, Li C (2017) A comparative study of log-structured merge-tree-based spatial indexes for big data. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp 147–150
Rabl T, Sadoghi M, Jacobsen H-A, Gómez-Villamor S, Muntés-Mulero V, Mankowskii S (2012) Solving big data challenges for enterprise application performance management. Proc VLDB Endowment 5(12)
Acknowledgements
This material is based upon work supported in part by the National Science Foundation (NSF) under grant numbers: 1047916, 1354329 and 1443080. The work used the ROGER supercomputer, which is supported by the NSF under grant number: 1429699. The authors would also like to thank the members of the CyberInfrastructure and Geospatial Information Laboratory (CIGI, http://cigi.illinois.edu/) for their insightful comments and discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was conducted when the first author was a PhD student and researcher at the CyberGIS Center.
Rights and permissions
About this article
Cite this article
Soltani, K., Padmanabhan, A. & Wang, S. GeoBalance: workload-aware partitioning of real-time spatiotemporal data. Geoinformatica 26, 67–94 (2022). https://doi.org/10.1007/s10707-021-00444-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-021-00444-z