research-article

Large-scale Data Exploration Using Explanatory Regression Functions

Authors:
Fotis Savva

University of Glasgow, United Kingdom

University of Glasgow, United Kingdom
View Profile

,
Christos Anagnostopoulos

University of Glasgow, United Kingdom

University of Glasgow, United Kingdom
View Profile

,
Peter Triantafillou

University of Warwick, United Kingdom

University of Warwick, United Kingdom
View Profile

,
Kostas Kolomvatsos

University of Thessaly, Greece

University of Thessaly, Greece
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 14 Issue 6Article No.: 76pp 1–33https://doi.org/10.1145/3410448

Published:28 September 2020Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace’s interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this shortcoming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads.

References

2016. Crimes - 2001 to present. Retrieved December 1, 2016 from https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2.Google Scholar
2019. Query Analytics Workloads Dataset Data Set. Retrieved July 29, 2019 from https://archive.ics.uci.edu/ml/datasets/Query+Analytics+Workloads+Dataset.Google Scholar
2020. HIGGS Data Set. Retrieved February 19, 2020 from http://archive.ics.uci.edu/ml/datasets/HIGGS.Google Scholar
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.Google ScholarDigital Library
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 13th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 153--164.Google ScholarDigital Library
Christos Anagnostopoulos, Fotis Savva, and Peter Triantafillou. 2018. Scalable aggregation predictive analytics. Applied Intelligence 48, 9 (2018), 2546--2567.Google ScholarDigital Library
Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM’15). IEEE, 691--696.Google ScholarDigital Library
Christos Anagnostopoulos and Peter Triantafillou. 2017. Efficient scalable accurate regression queries in In-DBMS analytics. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE’17). IEEE, 559--570.Google ScholarCross Ref
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.Google ScholarDigital Library
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 1 (2014), 1–9.Google ScholarCross Ref
Bokeh Development Team. 2018. Bokeh: Python Library for Interactive Visualization. Retrieved from https://bokeh.pydata.org/en/latest/.Google Scholar
Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 421--436.Google Scholar
Anup Chalamalla, Ihab F Ilyas, Mourad Ouzzani, and Paolo Papotti. 2014. Descriptive and prescriptive data cleaning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 445--456.Google ScholarDigital Library
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems 32, 2 (2007), 9.Google ScholarDigital Library
Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 287--298.Google ScholarDigital Library
James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Now Publishers Inc, 2009.Google Scholar
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarDigital Library
Çağatay Demiralp, Peter J. Haas, Srinivasan Parthasarathy, and Tejaswini Pedapati. 2017. Foresight: Recommending visual insights. Proceedings of the VLDB Endowment 10, 12 (2017), 1937--1940.Google ScholarDigital Library
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (2010), 1.Google ScholarCross Ref
Jerome H. Friedman. 1991. Multivariate adaptive regression splines. The Annals of Statistics Mar. (1991), 1:1--67.Google Scholar
Greg Hamerly and Charles Elkan. 2004. Learning the k in k-means. In Proceedings of the Advances in Neural Information Processing Systems. 281--288.Google Scholar
John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.Google ScholarCross Ref
Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83--85.Google ScholarCross Ref
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Vol. 26. ACM, 171--182.Google Scholar
Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing statistical data analysis in the cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 1--12.Google ScholarDigital Library
Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 277--281.Google ScholarDigital Library
Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data. ACM, 281--293.Google ScholarDigital Library
Bhargav Kanagal, Jian Li, and Amol Deshpande. 2011. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 841--852.Google ScholarDigital Library
Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain: Debugging mapreduce job performance. Proceedings of the VLDB Endowment 5, 7 (2012), 598--609.Google ScholarDigital Library
Sanjay Krishnan and Eugene Wu. 2017. PALM: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.Google ScholarDigital Library
Roger J. Lewis. 2000. An introduction to classification and regression tree (CART) analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine. Vol. 14.Google Scholar
Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization 8 Computer Graphics 20, 12 (2014), 2122--2131.Google ScholarCross Ref
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and explanations in databases. Proceedings of the VLDB Endowment 7, 13 (2014), 1715--1716.Google ScholarDigital Library
Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. ACM, 485--502.Google ScholarDigital Library
John Moody. 1988. Fast learning in multi-resolution hierarchies. In Proceedings of the 1st International Conference on Neural Information Processing Systems (NIPS’88). MIT Press, Cambridge, MA, 29--39.Google ScholarDigital Library
Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2029--2038.Google ScholarDigital Library
Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 587--602.Google ScholarDigital Library
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarDigital Library
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.Google ScholarDigital Library
Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarDigital Library
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explaining aggregates for exploratory analytics. In Proceedings of the 2018 IEEE International Conference on Big Data. IEEE, 478--487.Google ScholarCross Ref
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2019. Aggregate query prediction under dynamic workloads. In Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 671--676.Google ScholarCross Ref
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. Adaptive learning of aggregate analytics under dynamic workloads. Future Generation Computer Systems 109 (2020), 317–330.Google ScholarCross Ref
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. ML-AQP: Query-driven approximate query processing based on machine learning. Arxiv Preprint Arxiv:2003.06613 (2020).Google Scholar
Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with bounds on runtime and quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'11), Vol. 11. 296--301.Google Scholar
Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. 2002. The SDSS skyserver: Public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 570--581.Google ScholarDigital Library
Jean Claude Utazirubanda, Tomás M. León, and Papa Ngom. 2019. Variable selection with group LASSO approach: Application to Cox regression with frailty model. Communications in Statistics - Simulation and Computation Feb. (2019), 16:1--21.Google Scholar
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. S ee DB: Efficient data-driven visualization recommendations to support visual analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2182--2193.Google ScholarDigital Library
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diagnostic tool for data errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1231--1245.Google ScholarDigital Library
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. Qfix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.Google ScholarDigital Library
Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 557--572.Google ScholarDigital Library
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. DOI:https://doi.org/10.1145/1553374.1553516Google ScholarDigital Library
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.Google ScholarDigital Library
Eugene Wu, Samuel Madden, and Michael Stonebraker. 2013. SubZero: A fine-grained lineage system for scientific databases. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 865--876.Google ScholarDigital Library
Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 651--662.Google ScholarDigital Library

Index Terms

Large-scale Data Exploration Using Explanatory Regression Functions

Recommendations

Interactive data exploration using semantic windows
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) ...
Read More
Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube Engine
SIGMOD '23: Companion of the 2023 International Conference on Management of Data

We present Sudokube, a novel system that supports interactive speed querying on high-dimensional data using partially materialized data cubes. Given a storage budget, it judiciously chooses what projections to precompute and materialize during cube ...
Read More
Fast, Explainable View Detection to Characterize Exploration Queries
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 6
December 2020
376 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3427188
Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 September 2020
- Accepted: 1 July 2020
- Revised: 1 April 2020
- Received: 1 November 2019
Published in tkdd Volume 14, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Explainability
aggregate query explanation
data exploration
range query explanation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Large-scale Data Exploration Using Explanatory Regression Functions

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Interactive data exploration using semantic windows

Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube Engine

Fast, Explainable View Detection to Characterize Exploration Queries