Abstract
Histograms are widely used in selectivity estimation for one-dimensional data. Using the one-dimensional histograms to estimate the selectivity of the multidimensional queries will result in a high estimation error, unless the assumption of attribute independence is true. Constructing a multidimensional histogram also brings great challenges. The storage of a multidimensional histogram exponentially increases with the number of dimensions. In this paper, we propose a density-model-based multidimensional histogram. It uses a lightweight density model to predict the densities of a large number of regions instead of storing too many buckets. The experimental results indicate that our method can provide highly accurate selectivity estimations while occupying little space. In addition, the superiority of our method is more evident in high-dimensional data.
Similar content being viewed by others
References
Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, 1–3 June 1999, Philadelphia, Pennsylvania, USA, pp 181–192
Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, 1–3 June 1999, Philadelphia, Pennsylvania, USA, pp 13–24
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bruno N, Chaudhuri S, Gravano L (2001) Stholes: a multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, 21–24 May 2001, pp 211–222
Chaudhuri S, Narasayya VR (2007) Self-tuning database systems: a decade of progress. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CY, Ganti V, Kanne C-C, Klas W, Neuhold EJ (eds) Proceedings of the 33rd international conference on very large data bases, University of Vienna, Austria, 23–27 Sept 2007. ACM, pp 3–14
Cormode G, Garofalakis MN, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294
Dutt A, Wang C, Nazi A, Kandula S, Narasayya VR, Chaudhuri S (2019) Selectivity estimation for range predicates using lightweight models. Proc VLDB Endow 12(9):1044–1057
Gao B, Liu N, Wang X, Lan M, Zhao Z, Dellandréa E, Chen L (2018) A method to accelerate k-means and GMM computation with GPU and multi-core CPU. In: Fourth IEEE international conference on multimedia big data, BigMM 2018, Xi’an, China, 13–16 Sept 2018. IEEE, pp 1–5
Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, 16–18 May 2000, Dallas, Texas, USA, pp 463–474
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2):137–154
Hasan S, Thirumuruganathan S, Augustine J, Koudas N, Das G (2020) Deep learning models for selectivity estimation of multi-attribute queries. In: Maier D, Pottinger R, Doan A, Tan W-C, Alawini A, Ngo HQ (eds) Proceedings of the 2020 international conference on management of data, SIGMOD conference 2020, online conference [Portland, OR, USA], 14–19 June 2020. ACM, pp 1035–1050
Heimel M, Kiefer M, Markl V (2015) Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp 1477–1492
Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C (2020) Deepdb: Learn from data, not from queries!. Proc VLDB Endow 13(7):992–1005
Ioannidis YE (2003) The history of histograms (abridged). In: VLDB 2003, proceedings of 29th international conference on very large data bases, 9–12 Sept 2003, Berlin, Germany, pp 19–30
Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, 22–25 May 1995, pp 233–244
Kaushik R, Suciu D (2009) Consistent histograms in the presence of distinct value counts. Proc VLDB Endow 2(1):850–861
Khachatryan A, Müller E, Böhm K, Stier C (2016) Improving accuracy and robustness of self-tuning histograms by subspace clustering. In: 32nd IEEE international conference on data engineering, ICDE 2016, Helsinki, Finland, 16–20 May 2016, pp 1544–1545
Kipf A, Kipf T, Radke B, Leis V, Boncz PA, Kemper A (2019) Learned cardinalities: estimating correlated joins with deep learning. In: CIDR 2019, 9th Biennial conference on innovative data systems research, Asilomar, CA, USA, 13–16 Jan 2019, online proceedings. www.cidrdb.org
Kooi R (September 1980) The optimization of queries in relational databases. Ph.D. thesis, Case Western Reserve University
Low JS, Ghafoori Z, Bezdek JC, Leckie C (2019) Seeding on samples for accelerating k-means clustering. In: Proceedings of the 3rd international conference on big data and internet of things, BDIOT 2019, La Trobe University, Melbourne, VIC, Australia, 22–24 Aug 2019. ACM, pp 41–45
Matias Y, Vitter JS, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, 10–14 Sept 2000, Cairo, Egypt, pp 101–110
Moerkotte G, Neumann T, Steidl G (2009) Preventing bad plans by bounding the impact of cardinality estimation errors. Proc VLDB Endow 2(1):982–993
Müller M, Moerkotte G, Kolb O (2018) Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB 11(9):1016–1028
Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the 1988 ACM SIGMOD international conference on management of data, Chicago, Illinois, USA, 1–3 June 1988, pp 28–36
Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Database theory—ICDT ’99, 7th international conference, Jerusalem, Israel, 10–12 Jan 1999, Proceedings, pp 236–256
Park Y, Zhong S, Mozafari B (2020) Quicksel: quick selectivity learning with mixture models. In: Maier D, Pottinger R, Doan A, Tan W-C, Alawini A, Ngo HQ (eds) Proceedings of the 2020 international conference on management of data, SIGMOD Conference 2020, online conference [Portland, OR, USA], 14–19 June 2020. ACM, pp 1017–1033
Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, proceedings of annual meeting, Boston, Massachusetts, USA, 18–21 June 1984, pp 256–276
Poosala V, Ioannidis YE (1996) Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB’96, proceedings of 22th international conference on very large data bases, 3–6 Sept 1996, Mumbai (Bombay), India, pp 448–459
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB’97, proceedings of 23rd international conference on very large data bases, 25–29 Aug 1997, Athens, Greece, pp 486–495
Reuther A, Michaleas P, Jones M, Gadepally V, Samsi S, Kepner J (2019) Survey and benchmarking of machine learning accelerators. In: 2019 IEEE high performance extreme computing conference, HPEC 2019, Waltham, MA, USA, 24–26 Sept 2019. IEEE, pp 1–9
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on world wide web, WWW 2010, Raleigh, North Carolina, USA, 26–30 April 2010, pp 1177–1178
Shekelyan M, Dignös A, Gamper J (2017) Digithist: a histogram-based data summary with tight error bounds. PVLDB 10(11):1514–1525
Wu Y-L, Agrawal D, El Abbadi A (2002) Query estimation by adaptive sampling. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, USA, February 26–March 1, 2002, pp 639–648
Yang Z, Liang E, Kamsetty A, Chenggang W, Duan Y, Chen P, Abbeel P, Hellerstein JM, Krishnan S, Stoica I (2019) Deep unsupervised cardinality estimation. Proc VLDB Endow 13(3):279–292
Yildiz B, Büyüktanir T, Emekçi F (2016) Equi-depth histogram construction for big data with quality guarantees. CoRR, arXiv:1606.05633
Acknowledgements
This paper was supported by NSFC Grant U1866602 and CCF-Huawei Database System Innovation Research Plan CCF-HuaweiDBIR2020007B.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, M., Wang, H. Selectivity estimation with density-model-based multidimensional histogram. Knowl Inf Syst 63, 971–992 (2021). https://doi.org/10.1007/s10115-021-01547-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01547-7