EntropyDB: a probabilistic approach to approximate query processing

Orr, Laurel; Balazinska, Magdalena; Suciu, Dan

doi:10.1007/s00778-019-00582-9

EntropyDB: a probabilistic approach to approximate query processing

Special Issue Paper
Published: 02 November 2019

Volume 29, pages 539–567, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

844 Accesses
2 Citations
Explore all metrics

Abstract

We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries

Article Open access 05 July 2017

Dominik Ślęzak, Rick Glick, … Piotr Synak

Data Summarization Using Sampling Algorithms: Data Stream Case Study

Comprehensive and Efficient Workload Summarization

Article 17 November 2022

Shaleen Deep, Anja Gruenheid, … Jeffrey Naughton

Notes

Using the MaxEnt principle will generate a probability distribution that is different from the tuple-independent distribution because the MaxEnt principle does not guarantee tuple independence.
We support continuous data types by bucketizing their active domains.
This is a standard data model in several applications, such as differential privacy [31].
We abuse notation here for readability. Technically, \(\alpha _{i} = \alpha _{a_i}\), \(\beta _{i} = \alpha _{b_i}\), and \(\gamma _{i} = \alpha _{c_i}\).
\(\mathcal {P}([k])\) is the power set of \(\{1, 2, \ldots , k\}\).
This can be found by calculating, for all attribute pairs, the Chi-squared value on the contingency table of \(A_{i_1}\) and \(A_{i_2}\) and sorting from highest to lowest Chi-squared value.
The maximum amount of memory used in experiments was approximately 40 GB, meaning a system this large is not required.
Pair 3 is the same for FlightsCoarse and FlightsFine

References

Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM Sigmod Record, vol. 29, pp. 487–498. ACM (2000)
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)
Agarwal, S., et al.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys’13, pp. 29–42 (2013)
Applegate, D.A., Calinescu, G., Johnson, D.S., Karloff, H., Ligett, K., Wang, J.: Compressing rectilinear pictures and minimizing access control lists. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2003)
Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 1–10. Springer (2002)
Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.-D.: Matrix reordering methods for table and network visualization. In: Computer Graphics Forum, vol. 35, pp. 693–716. Wiley Online Library (2016)
Bekker, J., Davis, J., Choi, A., Darwiche, A., Van den Broeck, G.: Tractable learning for complex probability queries. In: Advances in Neural Information Processing Systems, pp. 2242–2250 (2015)
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)
Google Scholar
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Article Google Scholar
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. Int. J. Very Large Data Bases 10(2–3), 199–223 (2001)
Article Google Scholar
Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. ACM SIGMOD Rec. 30, 295–306 (2001)
Article Google Scholar
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519. ACM (2017)
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)
Article MathSciNet Google Scholar
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C., et al.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2011)
MATH Google Scholar
Crotty, A., Galakatos, A., Zgraggen, A., Binnig, C., Kraska, T.: The case for interactive data exploration accelerators (ideas). In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 11. ACM (2016)
Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)
Article Google Scholar
Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD Conference (2001)
Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample+seek: approximating aggregates with distribution precision guarantee. In: Proceedings of SIGMOD, pp. 679–694 (2016)
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems, pp. 2350–2358 (2015)
Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. Proc. VLDB Endow. 10(10), 1142–1153 (2017)
Article Google Scholar
Hardt, M., Rothblum, G.N.: A multiplicative weights mechanism for privacy-preserving data analysis. In: 2010 51st Annual IEEE Symposium on, Foundations of Computer Science (FOCS), pp. 61–70. IEEE (2010)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)
Hosangadi, A., Fallah, F., Kastner, R.: Factoring and eliminating common subexpressions in polynomial expressions. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004 (2004)
Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. (TODS) 33(4), 23 (2008)
Article Google Scholar
http://www.transtats.bts.gov/
Jetley, P. et al.: Massively parallel cosmological simulations with ChaNGa. In: Proceedings of IPDPS (2008)
Jordan, M.: An introduction to probabilistic graphical models (2003). http://www.cs.cmu.edu/~lebanon/pub/book/. Accessed 10 Nov 2018
Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning (2018). arXiv preprint arXiv:1809.00677
Li, C., et al.: Optimizing linear counting queries under differential privacy. In: Proceedings of PODS, pp. 123–134 (2010)
Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Article Google Scholar
Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. (2018). https://doi.org/10.1109/TKDE.2018.2877362
Article Google Scholar
Mäkinen, E., Siirtola, H.: Reordering the reorderable matrix as an algorithmic problem. In: International Conference on Theory and Application of Diagrams, pp. 453–468. Springer (2000)
Markl, V., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of VLDB, pp. 373–384. VLDB Endowment (2005)
Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)
Google Scholar
Murphy, K.: Undirected graphical models (2006). https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/ugm.pdf. Accessed 19 Nov 2018
Orr, L., Balazinska, M., Suciu, D.: Probabilistic database summarization for interactive data exploration. Proc. VLDB Endow. 10(10), 1154–1165 (2017)
Article Google Scholar
Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: Learning state representations for query optimization with deep reinforcement learning (2018). arXiv preprint arXiv:1803.08604
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: Verdictdb: universalizing approximate query processing. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. ACM (2018)
Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)
Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. ACM TODS 37(1), 6 (2012)
Article Google Scholar
Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Synth. Lect. Data Manag. 3(2), 1–180 (2011)
Article Google Scholar
Teh, Y.W., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)
Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models (2019). arXiv preprint arXiv:1903.10000
Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. VLDB J. 22(1), 3–27 (2013)
Article Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
Article Google Scholar
Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set? In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 471–482. VLDB Endowment (2007)
Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16(1), 3813–3847 (2015)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by NSF 1614738 and NSF 1535565. Laurel Orr is supported by the NSF Graduate Research Fellowship.

Author information

Authors and Affiliations

University of Washington, Seattle, WA, USA
Laurel Orr, Magdalena Balazinska & Dan Suciu

Authors

Laurel Orr
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Balazinska
View author publications
You can also search for this author in PubMed Google Scholar
Dan Suciu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurel Orr.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an extended version of the VLDB 2017 paper “Probabilistic Database Summarization for Interactive Data Exploration” [38].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Orr, L., Balazinska, M. & Suciu, D. EntropyDB: a probabilistic approach to approximate query processing. The VLDB Journal 29, 539–567 (2020). https://doi.org/10.1007/s00778-019-00582-9

Download citation

Received: 01 December 2018
Revised: 25 April 2019
Accepted: 15 October 2019
Published: 02 November 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s00778-019-00582-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EntropyDB: a probabilistic approach to approximate query processing

Abstract

Access this article

Similar content being viewed by others

A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries

Data Summarization Using Sampling Algorithms: Data Stream Case Study

Comprehensive and Efficient Workload Summarization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EntropyDB: a probabilistic approach to approximate query processing

Abstract

Access this article

Similar content being viewed by others

A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries

Data Summarization Using Sampling Algorithms: Data Stream Case Study

Comprehensive and Efficient Workload Summarization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation