Skip to main content
Log in

GeoBalance: workload-aware partitioning of real-time spatiotemporal data

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Numerous scientific disciplines have witnessed tremendous growth in the amount of spatial data produced over the past decade. To handle the volume and velocity of such data, researchers have embraced distributed systems, which partition data among multiple nodes to provide scalability and high availability. Previous work on partitioning large spatiotemporal data focuses on bulk-ingestion and static partitioning, hence is unable to handle dynamic data and querying workloads which is common for real-time data. In this paper we develop GeoBalance as a workload-aware partitioning approach for spatiotemporal data that can adapt partitions on-the-fly without disrupting the data ingestion/retrieval process. GeoBalance employs a spatial evolutionary algorithm to incrementally tune the partitions according to a geo-aware partitioning fitness function. In addition, we perform a rolling migration from one partitioning scheme to another to ensure that data ingestion and retrieval is not compromised during the partition change period. We conduct multiple experiments using a write-intensive hybrid workload of Twitter data and random hotspots, to demonstrate that the GeoBalance partitioning approach outperforms statically defined partitions and other partitioning algorithms such as k-d tree.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://vertx.io

  2. https://rocksdb.org

  3. Resourcing Open Geo-spatial Education and Research

References

  1. Miller HJ, Goodchild MF (2015) Data-driven geography. GeoJournal 80(4):449–461. https://doi.org/10.1007/s10708-014-9602-6

    Article  Google Scholar 

  2. Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani M, Elmeleegy H, Qadah T (2015) Aqwa: adaptive query workload aware partitioning of big spatial data. Proc VLDB Endowment 8(13):2062–2073

    Article  Google Scholar 

  3. Kleppmann M (2017) Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. ” O’Reilly Media, Inc.”

  4. Soliman A, Soltani K, Yin J, Padmanabhan A, Wang S (2017) Social sensing of urban land use based on analysis of twitter users mobility patterns. PloS one 12(7):e0181657

    Article  Google Scholar 

  5. Kamath KY, Caverlee J, Cheng Z, Sui DZ (2012) Spatial influence vs. community influence: modeling the global spread of social media. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 962–971

  6. Wang S, Hu H, Lin T, Liu Y, Padmanabhan A, Soltani K (2015) Cybergis for data-intensive knowledge discovery. SIGSPATIAL Special 6 (2):26–33. https://doi.org/10.1145/2744700.2744704

    Article  Google Scholar 

  7. Eldawy A, Mokbel MF (2015) The era of big spatial data: Challenges and opportunities. In: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02, MDM ’15. IEEE Computer Society, Washington, pp 7–10. https://doi.org/10.1109/MDM.2015.82

  8. Fox A, Eichelberger C, Hughes J, Lyon S (2013) Spatio-temporal indexing in non-relational distributed databases. In: 2013 IEEE International Conference on Big Data, pp 291–299

  9. Malensek M, Pallickara S, Pallickara S (2016) Autonomous cloud federation for high-throughput queries over voluminous datasets. IEEE Cloud Comput 3(3):40–49

    Article  Google Scholar 

  10. Eldawy A (2014) Spatialhadoop: Towards flexible and scalable spatial processing using mapreduce. In: Proceedings of the 2014 SIGMOD PhD Symposium, SIGMOD’14 PhD Symposium. ACM, New York, pp 46–50. https://doi.org/10.1145/2602622.2602625

  11. Serafini M, Taft R, Elmore AJ, Pavlo A, Aboulnaga A, Stonebraker M (2016) Clay: fine-grained adaptive partitioning for general database schemas. Proc VLDB Endowment 10(4):445–456

    Article  Google Scholar 

  12. Arzuaga E, Kaeli DR (2010) Quantifying load imbalance on virtualized enterprise servers. In: Proceedings of the First Joint WOSP/SIPEW International Conference on Performance Engineering, WOSP/SIPEW ’10. ACM, New York, pp 235–242. https://doi.org/10.1145/1712605.1712641

  13. Niemeyer G (2008) Geohash

  14. Malensek M, Pallickara S, Pallickara S (2013) Polygon-based query evaluation over geospatial data using distributed hash tables. In: 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing (UCC), pp 219–226

  15. Liu YY, Cho WKT, Wang S (2016) Pear: a massively parallel evolutionary computation approach for political redistricting optimization and analysis. Swarm Evol Comput 30:78–92. https://doi.org/10.1016/j.swevo.2016.04.004

    Article  Google Scholar 

  16. Kini A, Emanuele R (2014) Geotrellis: Adding geospatial capabilities to spark. Spark Summit

  17. Yu J, Wu J, Sarwat M (2015) Geospark: A cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 70

  18. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020

    Article  Google Scholar 

  19. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10

  20. Nishimura S, Das S, Agrawal D, El Abbadi A (2013) ∖mathcal {MD}-hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Databases 31(2):289–319

    Article  Google Scholar 

  21. Taft R, Mansour E, Serafini M, Duggan J, Elmore AJ, Aboulnaga A, Pavlo A, Stonebraker M (2014) E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proc VLDB Endowment 8 (3):245–256

    Article  Google Scholar 

  22. Curino C, Jones E, Zhang Y, Madden S (2010) Schism: A workload-driven approach to database replication and partitioning. Proc VLDB Endow. 3 (1-2):48–57. https://doi.org/10.14778/1920841.1920853

    Article  Google Scholar 

  23. Ghosh M, Xu L, Qian X, Kao T, Gupta I, Gupta H (2016) Getafix: Workload-aware distributed interactive analytics. UIUC Ideals

  24. Jindal A, Dittrich J (2011) Relax and let the database do the partitioning online. In: International Workshop on Business Intelligence for the Real-Time Enterprise. Springer, pp 65–80

  25. Pavlo A, Curino C, Zdonik S (2012) Skew-aware automatic database partitioning in shared-nothing, parallel oltp systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12. ACM, New Yorkpp 61–72. https://doi.org/10.1145/2213836.2213844

  26. Quamar A, Kumar KA, Deshpande A (2013) Sword: scalable workload-aware data placement for transactional workloads. In: Proceedings of the 16th International Conference on Extending Database Technology. ACM, pp 430–441

  27. Wu X, Murray AT (2008) A new approach to quantifying spatial contiguity using graph theory and spatial interaction. Int J Geogr Inf Sci 22(4):387–407

    Article  Google Scholar 

  28. Tzoumas K, Yiu ML, Jensen CS (2009) Workload-aware indexing of continuously moving objects. Proc VLDB Endowment 2(1):1186–1197

    Article  Google Scholar 

  29. Achakeev D, Seeger B, Widmayer P (2012) Sort-based query-adaptive loading of r-trees. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2080–2084

  30. DeRose L, Homer B, Johnson D (2007) Detecting application load imbalance on high end massively parallel systems. In: Proceedings of the 13th International Euro-Par Conference on Parallel Processing, Euro-Par’07. Springer, Berlin, pp 150–159. http://dl.acm.org/citation.cfm?id=2391541.2391560

  31. Kai CAO, Boa HUANG (2010) Comparison of spatial compactness evaluation methods for simple genetic algorithm based land use planning optimization problem. In: Proceedings of the Joint International Conference on Theory, Data Handling and Modelling in GeoSpatial Information Science, pp 26–28

  32. Beasley D, Bull DR, Martin RR (1993) An overview of genetic algorithms: Part 1, fundamentals. Univ Comput 15(2):58–69

    Google Scholar 

  33. Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatialhadoop. Proc VLDB Endow 8(12):1602–1605. https://doi.org/10.14778/2824032.2824057

    Article  Google Scholar 

  34. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proc VLDB Endow 6(11):1009–1020. https://doi.org/10.14778/2536222.2536227

    Article  Google Scholar 

  35. Gupta A, Yang F, Govig J, Kirsch A, Chan K, Lai K, Wu S, Dhoot S, Kumar AR, Agiwal A, Bhansali S, Hong M, Cameron J, Siddiqi M, Jones D, Shute J, Gubarev A, Venkataraman S, Agrawal D (2016) Mesa: A geo-replicated online data warehouse for google’s advertising system. Commun ACM 59(7):117–125. https://doi.org/10.1145/2936722

    Article  Google Scholar 

  36. Marz N, Warren J (2015) Big data: Principles and best practices of scalable realtime data systems. Manning Publications Co.

  37. Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10. USENIX Association, Berkeley, pp 11–11. http://dl.acm.org/citation.cfm?id=1855840.1855851

  38. O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (lsm-tree). Acta Inf 33(4):351–385. https://doi.org/10.1007/s002360050048

    Article  Google Scholar 

  39. Kim YS, Kim T, Carey MJ, Li C (2017) A comparative study of log-structured merge-tree-based spatial indexes for big data. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp 147–150

  40. Rabl T, Sadoghi M, Jacobsen H-A, Gómez-Villamor S, Muntés-Mulero V, Mankowskii S (2012) Solving big data challenges for enterprise application performance management. Proc VLDB Endowment 5(12)

Download references

Acknowledgements

This material is based upon work supported in part by the National Science Foundation (NSF) under grant numbers: 1047916, 1354329 and 1443080. The work used the ROGER supercomputer, which is supported by the NSF under grant number: 1429699. The authors would also like to thank the members of the CyberInfrastructure and Geospatial Information Laboratory (CIGI, http://cigi.illinois.edu/) for their insightful comments and discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaowen Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research was conducted when the first author was a PhD student and researcher at the CyberGIS Center.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soltani, K., Padmanabhan, A. & Wang, S. GeoBalance: workload-aware partitioning of real-time spatiotemporal data. Geoinformatica 26, 67–94 (2022). https://doi.org/10.1007/s10707-021-00444-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-021-00444-z

Keywords

Navigation