Abstract
We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.
Similar content being viewed by others
Notes
Using the MaxEnt principle will generate a probability distribution that is different from the tuple-independent distribution because the MaxEnt principle does not guarantee tuple independence.
We support continuous data types by bucketizing their active domains.
This is a standard data model in several applications, such as differential privacy [31].
We abuse notation here for readability. Technically, \(\alpha _{i} = \alpha _{a_i}\), \(\beta _{i} = \alpha _{b_i}\), and \(\gamma _{i} = \alpha _{c_i}\).
\(\mathcal {P}([k])\) is the power set of \(\{1, 2, \ldots , k\}\).
This can be found by calculating, for all attribute pairs, the Chi-squared value on the contingency table of \(A_{i_1}\) and \(A_{i_2}\) and sorting from highest to lowest Chi-squared value.
The maximum amount of memory used in experiments was approximately 40 GB, meaning a system this large is not required.
Pair 3 is the same for FlightsCoarse and FlightsFine
References
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM Sigmod Record, vol. 29, pp. 487–498. ACM (2000)
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)
Agarwal, S., et al.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys’13, pp. 29–42 (2013)
Applegate, D.A., Calinescu, G., Johnson, D.S., Karloff, H., Ligett, K., Wang, J.: Compressing rectilinear pictures and minimizing access control lists. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2003)
Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 1–10. Springer (2002)
Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.-D.: Matrix reordering methods for table and network visualization. In: Computer Graphics Forum, vol. 35, pp. 693–716. Wiley Online Library (2016)
Bekker, J., Davis, J., Choi, A., Darwiche, A., Van den Broeck, G.: Tractable learning for complex probability queries. In: Advances in Neural Information Processing Systems, pp. 2242–2250 (2015)
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. Int. J. Very Large Data Bases 10(2–3), 199–223 (2001)
Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. ACM SIGMOD Rec. 30, 295–306 (2001)
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519. ACM (2017)
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C., et al.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2011)
Crotty, A., Galakatos, A., Zgraggen, A., Binnig, C., Kraska, T.: The case for interactive data exploration accelerators (ideas). In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 11. ACM (2016)
Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)
Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD Conference (2001)
Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample+seek: approximating aggregates with distribution precision guarantee. In: Proceedings of SIGMOD, pp. 679–694 (2016)
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems, pp. 2350–2358 (2015)
Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. Proc. VLDB Endow. 10(10), 1142–1153 (2017)
Hardt, M., Rothblum, G.N.: A multiplicative weights mechanism for privacy-preserving data analysis. In: 2010 51st Annual IEEE Symposium on, Foundations of Computer Science (FOCS), pp. 61–70. IEEE (2010)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)
Hosangadi, A., Fallah, F., Kastner, R.: Factoring and eliminating common subexpressions in polynomial expressions. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004 (2004)
Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. (TODS) 33(4), 23 (2008)
Jetley, P. et al.: Massively parallel cosmological simulations with ChaNGa. In: Proceedings of IPDPS (2008)
Jordan, M.: An introduction to probabilistic graphical models (2003). http://www.cs.cmu.edu/~lebanon/pub/book/. Accessed 10 Nov 2018
Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning (2018). arXiv preprint arXiv:1809.00677
Li, C., et al.: Optimizing linear counting queries under differential privacy. In: Proceedings of PODS, pp. 123–134 (2010)
Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. (2018). https://doi.org/10.1109/TKDE.2018.2877362
Mäkinen, E., Siirtola, H.: Reordering the reorderable matrix as an algorithmic problem. In: International Conference on Theory and Application of Diagrams, pp. 453–468. Springer (2000)
Markl, V., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of VLDB, pp. 373–384. VLDB Endowment (2005)
Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)
Murphy, K.: Undirected graphical models (2006). https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/ugm.pdf. Accessed 19 Nov 2018
Orr, L., Balazinska, M., Suciu, D.: Probabilistic database summarization for interactive data exploration. Proc. VLDB Endow. 10(10), 1154–1165 (2017)
Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: Learning state representations for query optimization with deep reinforcement learning (2018). arXiv preprint arXiv:1803.08604
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: Verdictdb: universalizing approximate query processing. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. ACM (2018)
Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)
Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. ACM TODS 37(1), 6 (2012)
Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Synth. Lect. Data Manag. 3(2), 1–180 (2011)
Teh, Y.W., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)
Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models (2019). arXiv preprint arXiv:1903.10000
Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. VLDB J. 22(1), 3–27 (2013)
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set? In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 471–482. VLDB Endowment (2007)
Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16(1), 3813–3847 (2015)
Acknowledgements
This work is supported by NSF 1614738 and NSF 1535565. Laurel Orr is supported by the NSF Graduate Research Fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is an extended version of the VLDB 2017 paper “Probabilistic Database Summarization for Interactive Data Exploration” [38].
Rights and permissions
About this article
Cite this article
Orr, L., Balazinska, M. & Suciu, D. EntropyDB: a probabilistic approach to approximate query processing. The VLDB Journal 29, 539–567 (2020). https://doi.org/10.1007/s00778-019-00582-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00582-9