skip to main content
research-article

Large-scale Data Exploration Using Explanatory Regression Functions

Published:28 September 2020Publication History
Skip Abstract Section

Abstract

Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace’s interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this shortcoming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads.

References

  1. 2016. Crimes - 2001 to present. Retrieved December 1, 2016 from https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2.Google ScholarGoogle Scholar
  2. 2019. Query Analytics Workloads Dataset Data Set. Retrieved July 29, 2019 from https://archive.ics.uci.edu/ml/datasets/Query+Analytics+Workloads+Dataset.Google ScholarGoogle Scholar
  3. 2020. HIGGS Data Set. Retrieved February 19, 2020 from http://archive.ics.uci.edu/ml/datasets/HIGGS.Google ScholarGoogle Scholar
  4. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 13th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 153--164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christos Anagnostopoulos, Fotis Savva, and Peter Triantafillou. 2018. Scalable aggregation predictive analytics. Applied Intelligence 48, 9 (2018), 2546--2567.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM’15). IEEE, 691--696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christos Anagnostopoulos and Peter Triantafillou. 2017. Efficient scalable accurate regression queries in In-DBMS analytics. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE’17). IEEE, 559--570.Google ScholarGoogle ScholarCross RefCross Ref
  9. Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 1 (2014), 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  11. Bokeh Development Team. 2018. Bokeh: Python Library for Interactive Visualization. Retrieved from https://bokeh.pydata.org/en/latest/.Google ScholarGoogle Scholar
  12. Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 421--436.Google ScholarGoogle Scholar
  13. Anup Chalamalla, Ihab F Ilyas, Mourad Ouzzani, and Paolo Papotti. 2014. Descriptive and prescriptive data cleaning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 445--456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems 32, 2 (2007), 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 287--298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Now Publishers Inc, 2009.Google ScholarGoogle Scholar
  17. Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Çağatay Demiralp, Peter J. Haas, Srinivasan Parthasarathy, and Tejaswini Pedapati. 2017. Foresight: Recommending visual insights. Proceedings of the VLDB Endowment 10, 12 (2017), 1937--1940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (2010), 1.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jerome H. Friedman. 1991. Multivariate adaptive regression splines. The Annals of Statistics Mar. (1991), 1:1--67.Google ScholarGoogle Scholar
  22. Greg Hamerly and Charles Elkan. 2004. Learning the k in k-means. In Proceedings of the Advances in Neural Information Processing Systems. 281--288.Google ScholarGoogle Scholar
  23. John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.Google ScholarGoogle ScholarCross RefCross Ref
  24. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83--85.Google ScholarGoogle ScholarCross RefCross Ref
  25. Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Vol. 26. ACM, 171--182.Google ScholarGoogle Scholar
  26. Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing statistical data analysis in the cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 277--281.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data. ACM, 281--293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Bhargav Kanagal, Jian Li, and Amol Deshpande. 2011. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 841--852.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain: Debugging mapreduce job performance. Proceedings of the VLDB Endowment 5, 7 (2012), 598--609.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sanjay Krishnan and Eugene Wu. 2017. PALM: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Roger J. Lewis. 2000. An introduction to classification and regression tree (CART) analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine. Vol. 14.Google ScholarGoogle Scholar
  33. Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization 8 Computer Graphics 20, 12 (2014), 2122--2131.Google ScholarGoogle ScholarCross RefCross Ref
  34. Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and explanations in databases. Proceedings of the VLDB Endowment 7, 13 (2014), 1715--1716.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. ACM, 485--502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John Moody. 1988. Fast learning in multi-resolution hierarchies. In Proceedings of the 1st International Conference on Neural Information Processing Systems (NIPS’88). MIT Press, Cambridge, MA, 29--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2029--2038.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 587--602.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explaining aggregates for exploratory analytics. In Proceedings of the 2018 IEEE International Conference on Big Data. IEEE, 478--487.Google ScholarGoogle ScholarCross RefCross Ref
  43. Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2019. Aggregate query prediction under dynamic workloads. In Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 671--676.Google ScholarGoogle ScholarCross RefCross Ref
  44. Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. Adaptive learning of aggregate analytics under dynamic workloads. Future Generation Computer Systems 109 (2020), 317–330.Google ScholarGoogle ScholarCross RefCross Ref
  45. Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. ML-AQP: Query-driven approximate query processing based on machine learning. Arxiv Preprint Arxiv:2003.06613 (2020).Google ScholarGoogle Scholar
  46. Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with bounds on runtime and quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'11), Vol. 11. 296--301.Google ScholarGoogle Scholar
  47. Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. 2002. The SDSS skyserver: Public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 570--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jean Claude Utazirubanda, Tomás M. León, and Papa Ngom. 2019. Variable selection with group LASSO approach: Application to Cox regression with frailty model. Communications in Statistics - Simulation and Computation Feb. (2019), 16:1--21.Google ScholarGoogle Scholar
  49. Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. S ee DB: Efficient data-driven visualization recommendations to support visual analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2182--2193.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diagnostic tool for data errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1231--1245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. Qfix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 557--572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. DOI:https://doi.org/10.1145/1553374.1553516Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Eugene Wu, Samuel Madden, and Michael Stonebraker. 2013. SubZero: A fine-grained lineage system for scientific databases. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 865--876.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 651--662.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale Data Exploration Using Explanatory Regression Functions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 6
          December 2020
          376 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/3427188
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 September 2020
          • Accepted: 1 July 2020
          • Revised: 1 April 2020
          • Received: 1 November 2019
          Published in tkdd Volume 14, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format