Skip to main content
Log in

EntropyDB: a probabilistic approach to approximate query processing

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. Using the MaxEnt principle will generate a probability distribution that is different from the tuple-independent distribution because the MaxEnt principle does not guarantee tuple independence.

  2. We support continuous data types by bucketizing their active domains.

  3. This is a standard data model in several applications, such as differential privacy [31].

  4. We abuse notation here for readability. Technically, \(\alpha _{i} = \alpha _{a_i}\), \(\beta _{i} = \alpha _{b_i}\), and \(\gamma _{i} = \alpha _{c_i}\).

  5. \(\mathcal {P}([k])\) is the power set of \(\{1, 2, \ldots , k\}\).

  6. This can be found by calculating, for all attribute pairs, the Chi-squared value on the contingency table of \(A_{i_1}\) and \(A_{i_2}\) and sorting from highest to lowest Chi-squared value.

  7. The maximum amount of memory used in experiments was approximately 40 GB, meaning a system this large is not required.

  8. Pair 3 is the same for FlightsCoarse and FlightsFine

References

  1. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM Sigmod Record, vol. 29, pp. 487–498. ACM (2000)

  2. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)

  3. Agarwal, S., et al.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys’13, pp. 29–42 (2013)

  4. Applegate, D.A., Calinescu, G., Johnson, D.S., Karloff, H., Ligett, K., Wang, J.: Compressing rectilinear pictures and minimizing access control lists. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms

  5. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2003)

  6. Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 1–10. Springer (2002)

  7. Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.-D.: Matrix reordering methods for table and network visualization. In: Computer Graphics Forum, vol. 35, pp. 693–716. Wiley Online Library (2016)

  8. Bekker, J., Davis, J., Choi, A., Darwiche, A., Van den Broeck, G.: Tractable learning for complex probability queries. In: Advances in Neural Information Processing Systems, pp. 2242–2250 (2015)

  9. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)

    Google Scholar 

  10. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)

    Article  Google Scholar 

  11. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. Int. J. Very Large Data Bases 10(2–3), 199–223 (2001)

    Article  Google Scholar 

  12. Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. ACM SIGMOD Rec. 30, 295–306 (2001)

    Article  Google Scholar 

  13. Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519. ACM (2017)

  14. Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)

    Article  MathSciNet  Google Scholar 

  15. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C., et al.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2011)

    MATH  Google Scholar 

  16. Crotty, A., Galakatos, A., Zgraggen, A., Binnig, C., Kraska, T.: The case for interactive data exploration accelerators (ideas). In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 11. ACM (2016)

  17. Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)

    Article  Google Scholar 

  18. Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD Conference (2001)

  19. Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample+seek: approximating aggregates with distribution precision guarantee. In: Proceedings of SIGMOD, pp. 679–694 (2016)

  20. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems, pp. 2350–2358 (2015)

  21. Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. Proc. VLDB Endow. 10(10), 1142–1153 (2017)

    Article  Google Scholar 

  22. Hardt, M., Rothblum, G.N.: A multiplicative weights mechanism for privacy-preserving data analysis. In: 2010 51st Annual IEEE Symposium on, Foundations of Computer Science (FOCS), pp. 61–70. IEEE (2010)

  23. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)

  24. Hosangadi, A., Fallah, F., Kastner, R.: Factoring and eliminating common subexpressions in polynomial expressions. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004 (2004)

  25. Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. (TODS) 33(4), 23 (2008)

    Article  Google Scholar 

  26. http://www.transtats.bts.gov/

  27. Jetley, P. et al.: Massively parallel cosmological simulations with ChaNGa. In: Proceedings of IPDPS (2008)

  28. Jordan, M.: An introduction to probabilistic graphical models (2003). http://www.cs.cmu.edu/~lebanon/pub/book/. Accessed 10 Nov 2018

  29. Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)

  30. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning (2018). arXiv preprint arXiv:1809.00677

  31. Li, C., et al.: Optimizing linear counting queries under differential privacy. In: Proceedings of PODS, pp. 123–134 (2010)

  32. Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)

    Article  Google Scholar 

  33. Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. (2018). https://doi.org/10.1109/TKDE.2018.2877362

    Article  Google Scholar 

  34. Mäkinen, E., Siirtola, H.: Reordering the reorderable matrix as an algorithmic problem. In: International Conference on Theory and Application of Diagrams, pp. 453–468. Springer (2000)

  35. Markl, V., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of VLDB, pp. 373–384. VLDB Endowment (2005)

  36. Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)

    Google Scholar 

  37. Murphy, K.: Undirected graphical models (2006). https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/ugm.pdf. Accessed 19 Nov 2018

  38. Orr, L., Balazinska, M., Suciu, D.: Probabilistic database summarization for interactive data exploration. Proc. VLDB Endow. 10(10), 1154–1165 (2017)

    Article  Google Scholar 

  39. Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: Learning state representations for query optimization with deep reinforcement learning (2018). arXiv preprint arXiv:1803.08604

  40. Park, Y., Mozafari, B., Sorenson, J., Wang, J.: Verdictdb: universalizing approximate query processing. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. ACM (2018)

  41. Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)

  42. Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. ACM TODS 37(1), 6 (2012)

    Article  Google Scholar 

  43. Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Synth. Lect. Data Manag. 3(2), 1–180 (2011)

    Article  Google Scholar 

  44. Teh, Y.W., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)

  45. Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models (2019). arXiv preprint arXiv:1903.10000

  46. Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. VLDB J. 22(1), 3–27 (2013)

    Article  Google Scholar 

  47. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)

    Article  Google Scholar 

  48. Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set? In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 471–482. VLDB Endowment (2007)

  49. Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16(1), 3813–3847 (2015)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by NSF 1614738 and NSF 1535565. Laurel Orr is supported by the NSF Graduate Research Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurel Orr.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an extended version of the VLDB 2017 paper “Probabilistic Database Summarization for Interactive Data Exploration” [38].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orr, L., Balazinska, M. & Suciu, D. EntropyDB: a probabilistic approach to approximate query processing. The VLDB Journal 29, 539–567 (2020). https://doi.org/10.1007/s00778-019-00582-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00582-9

Keywords

Navigation