An adaptive strategy for statistics collecting in distributed database

Gao, Jintao; Liu, Wenjie; Li, Zhanhuai

doi:10.1007/s11704-019-9107-z

An adaptive strategy for statistics collecting in distributed database

Research Article
Published: 03 January 2020

Volume 14, article number 145610, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jintao Gao¹,
Wenjie Liu¹ &
Zhanhuai Li¹

65 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Collecting statistics is a time- and resource-consuming operation in database systems. It is even more challenging to efficiently collect statistics without affecting system performance, meanwhile keeping correctness in distributed database. Traditional strategies usually consider one dimension during collecting statistics, which is lack of adaptiveness. In this paper, we propose an adaptive strategy for statistics collecting(ASC), which well balances collecting efficiency, correctness of statistics and effect to system performance. We formally define the procedure of collecting statistics and abstract the relationships among collecting efficiency, correctness of statistics and effect to system performance, and introduce an elastic structure(ESI) storing necessary information generated during proceeding our strategy. ASC can pick appropriate time to trigger collecting action and filter unnecessary tasks, meanwhile reasonably allocating collecting tasks to appropriate executing locations with right executing models through the information stored at ESI. We implement and evaluate our strategy in a distributed database. Experiments show that our solutions generally improve the efficiency and correctness of collecting statistics, moreover, reduce the negative effect to system performance comparing with other strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Antonios Makris, Konstantinos Tserpes, … Dimosthenis Anagnostopoulos

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Marios Fragkoulis, Paris Carbone, … Asterios Katsifodimos

References

Hazar H, Felix N. Cardinality estimation: an experimental survey. Proceedings of the VLDB Endowment, 2017, 11(12): 499–512
Google Scholar
Woodruff D P, Zhang Q. Distributed statistical estimation of matrix products with applications. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2018, 383–394
Grohe M, Schweikardt N. First-order query evaluation with cardinality conditions. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Sytems. 2018, 253–266
Magnus M, Moerkotte G, Kolb O. Improved selectivity estimation by combining knowledge from sampling and synopses. Proceedings of the VLDB Endowment, 2018, 11(9): 1016–1028
Article Google Scholar
Srinath S, Rimma N, Josep A S, Andrew C, Mostafa E, Alan H, Eric R, Mahadevan S S, David D, César G L. Query optimization in microsoft SQL server PDW. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 767–776
Chen J, Jindel S, Walzer R, Sen R, Jimsheleishvilli N, Andrews M. The MemSQL query optimizer Proceedings of the VLDB Endowment, 2016, 9(13): 1401–1412
Article Google Scholar
Soliman M A, Antova L, Raghavan V, El-Helw A, Gu Z, Shen E, Caragea G C, Garcia-Alvarado C, Rahman F, Petropoulos M, Waas F, Narayanan S, Krikellas K, Baldwin R. Orca: a modular query optimizer architecture for big data. In: Proceedings of the 2014 ACM SIG-MOD International Conference on Management of Data. 2014, 337–348
Chakkappen S, Budalakoti S, Krishnamachari R, Valluri S, Wood A, Zait M. Adaptive statistics in Oracle 12c. Proceedings of the VLDB Endowment, 2017, 10(12): 1813–1824
Article Google Scholar
Macke S, Zhang Y, Huang S, Parameswaran A. Adaptive sampling for rapidly matching histograms. Proceedings of the VLDB Endowment, 2018, 11(10): 1262–1275
Article Google Scholar
Chakkappen S, Cruanes T, Dageville B, Linan J, Uri H, Hong S, Mohamed Z. Efficient and scalable statistics gathering for large databases in Oracle 11g. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2008, 1053–1064
Graefe G. The cascades framework for query optimization. Data Engineering Bulletin, 1995, 18(5): 19–29
Google Scholar
Boncz P, Neumann T, Erling O. TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: Proceedings of Technology Conference on Performance Evaluation & Benchmarking. 2014, 61–76
Yang Z. The architecture of OceanBase relational database system. Journal of East China Normal University (Natural Sciences), 2014, 5: 141–148
Google Scholar
BeyerK S, Haas P J, Reinwald B, Sismanis Y, Gemulla R. On synopses for distinct-value estimation under multiset operations. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2007, 199–210
Gemulla R, Lehner W, Haas P J. A dip in the reservoir: maintaining sample synopses of evolving datasets. In: Proceedings of the 32nd International Conference on Very Large Data Bases. 2006, 595–606
Teimouri M, Rezakhah S, Mohammadpour A. Statistic formultivariate stable distributions. Journal of Probability and Statistics, 2017, 2017: 1–12
Article Google Scholar
Das D, Yan J, Zait M, Vallur S R, Vyas N, Krishnamachari R, Gaharwar P, Kamp J, Mukherjee N. Query optimization in Oracle 12c database in-memory. Proceedings of the VLDB Endowment, 2015, 8(12): 1770–1781
Article Google Scholar
Tian F, DeWitt D J. Tuple routing strategies for distributed eddies. In: Proceedings of the 29th International Conference on Very Large Data Bases. 2003, 333–344
Zhou Y, Ooi B C, Tan K L. Dynamic load management for distributed continuous query systems. In: Proceedings of the 21st International Conference on Data Engineering. 2005, 322–323
Elseidy M, Elguindy A, Vitorovic A, Koch C. Scalarble and adaptive online joins. Proceedings of the VLDB Endowment, 2014, 7(6): 441–452
Article Google Scholar
Elhelw A, Ilyas I F, Lau W, Markl V, Zuzarte C. Collecting and maintaining just-in-time statistics. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 516–525

Download references

Acknowledgements

This project was supported by Key Research and Development Program (2018YFB1003403), the National Natural Science Foundation of China (Grant Nos. 61732014, 61672432, 61672434) and Natural Science Basic Research Plan in Shaanxi Province of China (2017JM6104).

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
Jintao Gao, Wenjie Liu & Zhanhuai Li

Authors

Jintao Gao
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhuai Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jintao Gao.

Additional information

Jintao Gao received the BS and MS degrees in school of computer science and technology from Shandong Jianzhu University, China in 2009 and 2012, respectively. Right now, he is a PhD student at Department of Computer Software and Theories, School of Computer, Northwestern Polytechnical University, China. His research interests include query optimization in distributed database and massive data management.

Wenjie Liu is an associate professor at Department of Computer Software and Theories, School of Computer, Northwestern Polytechnical University, China. Her research interests include cloud computing, distributed database, and massive data management.

Zhanhuai Li is a professor at Department of Computer Software and Theories, School of Computer, Northwestern Polytechnical University, China. He is a doctorial supervisor, CCF fellow and Database Committee fellow of China. His research interests include steam data management, data mining,massive data management, and cloud data storage.

Electronic Supplementary Material