Abstract
Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace’s interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this shortcoming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads.
- 2016. Crimes - 2001 to present. Retrieved December 1, 2016 from https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2.Google Scholar
- 2019. Query Analytics Workloads Dataset Data Set. Retrieved July 29, 2019 from https://archive.ics.uci.edu/ml/datasets/Query+Analytics+Workloads+Dataset.Google Scholar
- 2020. HIGGS Data Set. Retrieved February 19, 2020 from http://archive.ics.uci.edu/ml/datasets/HIGGS.Google Scholar
- Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.Google ScholarDigital Library
- Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 13th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 153--164.Google ScholarDigital Library
- Christos Anagnostopoulos, Fotis Savva, and Peter Triantafillou. 2018. Scalable aggregation predictive analytics. Applied Intelligence 48, 9 (2018), 2546--2567.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM’15). IEEE, 691--696.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2017. Efficient scalable accurate regression queries in In-DBMS analytics. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE’17). IEEE, 559--570.Google ScholarCross Ref
- Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.Google ScholarDigital Library
- Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 1 (2014), 1–9.Google ScholarCross Ref
- Bokeh Development Team. 2018. Bokeh: Python Library for Interactive Visualization. Retrieved from https://bokeh.pydata.org/en/latest/.Google Scholar
- Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 421--436.Google Scholar
- Anup Chalamalla, Ihab F Ilyas, Mourad Ouzzani, and Paolo Papotti. 2014. Descriptive and prescriptive data cleaning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 445--456.Google ScholarDigital Library
- Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems 32, 2 (2007), 9.Google ScholarDigital Library
- Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 287--298.Google ScholarDigital Library
- James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Now Publishers Inc, 2009.Google Scholar
- Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarDigital Library
- Çağatay Demiralp, Peter J. Haas, Srinivasan Parthasarathy, and Tejaswini Pedapati. 2017. Foresight: Recommending visual insights. Proceedings of the VLDB Endowment 10, 12 (2017), 1937--1940.Google ScholarDigital Library
- Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
- Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (2010), 1.Google ScholarCross Ref
- Jerome H. Friedman. 1991. Multivariate adaptive regression splines. The Annals of Statistics Mar. (1991), 1:1--67.Google Scholar
- Greg Hamerly and Charles Elkan. 2004. Learning the k in k-means. In Proceedings of the Advances in Neural Information Processing Systems. 281--288.Google Scholar
- John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.Google ScholarCross Ref
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83--85.Google ScholarCross Ref
- Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Vol. 26. ACM, 171--182.Google Scholar
- Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing statistical data analysis in the cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 1--12.Google ScholarDigital Library
- Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 277--281.Google ScholarDigital Library
- Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data. ACM, 281--293.Google ScholarDigital Library
- Bhargav Kanagal, Jian Li, and Amol Deshpande. 2011. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 841--852.Google ScholarDigital Library
- Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain: Debugging mapreduce job performance. Proceedings of the VLDB Endowment 5, 7 (2012), 598--609.Google ScholarDigital Library
- Sanjay Krishnan and Eugene Wu. 2017. PALM: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.Google ScholarDigital Library
- Roger J. Lewis. 2000. An introduction to classification and regression tree (CART) analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine. Vol. 14.Google Scholar
- Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization 8 Computer Graphics 20, 12 (2014), 2122--2131.Google ScholarCross Ref
- Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and explanations in databases. Proceedings of the VLDB Endowment 7, 13 (2014), 1715--1716.Google ScholarDigital Library
- Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. ACM, 485--502.Google ScholarDigital Library
- John Moody. 1988. Fast learning in multi-resolution hierarchies. In Proceedings of the 1st International Conference on Neural Information Processing Systems (NIPS’88). MIT Press, Cambridge, MA, 29--39.Google ScholarDigital Library
- Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2029--2038.Google ScholarDigital Library
- Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 587--602.Google ScholarDigital Library
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarDigital Library
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.Google ScholarDigital Library
- Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarDigital Library
- Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explaining aggregates for exploratory analytics. In Proceedings of the 2018 IEEE International Conference on Big Data. IEEE, 478--487.Google ScholarCross Ref
- Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2019. Aggregate query prediction under dynamic workloads. In Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 671--676.Google ScholarCross Ref
- Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. Adaptive learning of aggregate analytics under dynamic workloads. Future Generation Computer Systems 109 (2020), 317–330.Google ScholarCross Ref
- Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. ML-AQP: Query-driven approximate query processing based on machine learning. Arxiv Preprint Arxiv:2003.06613 (2020).Google Scholar
- Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with bounds on runtime and quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'11), Vol. 11. 296--301.Google Scholar
- Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. 2002. The SDSS skyserver: Public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 570--581.Google ScholarDigital Library
- Jean Claude Utazirubanda, Tomás M. León, and Papa Ngom. 2019. Variable selection with group LASSO approach: Application to Cox regression with frailty model. Communications in Statistics - Simulation and Computation Feb. (2019), 16:1--21.Google Scholar
- Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. S ee DB: Efficient data-driven visualization recommendations to support visual analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2182--2193.Google ScholarDigital Library
- Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diagnostic tool for data errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1231--1245.Google ScholarDigital Library
- Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. Qfix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.Google ScholarDigital Library
- Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 557--572.Google ScholarDigital Library
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. DOI:https://doi.org/10.1145/1553374.1553516Google ScholarDigital Library
- Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.Google ScholarDigital Library
- Eugene Wu, Samuel Madden, and Michael Stonebraker. 2013. SubZero: A fine-grained lineage system for scientific databases. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 865--876.Google ScholarDigital Library
- Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 651--662.Google ScholarDigital Library
Index Terms
- Large-scale Data Exploration Using Explanatory Regression Functions
Recommendations
Interactive data exploration using semantic windows
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataWe present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) ...
Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube Engine
SIGMOD '23: Companion of the 2023 International Conference on Management of DataWe present Sudokube, a novel system that supports interactive speed querying on high-dimensional data using partially materialized data cubes. Given a storage budget, it judiciously chooses what projections to precompute and materialize during cube ...
Fast, Explainable View Detection to Characterize Exploration Queries
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database ManagementThe aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them ...
Comments