Abstract
We empirically evaluate lightweight moment estimators for the single-pass quantile approximation problem, including maximum entropy methods and orthogonal series with Fourier, Cosine, Legendre, Chebyshev and Hermite basis functions. We show how to apply stable summation formulas to offset numerical precision issues for higher-order moments, leading to reliable single-pass moment estimators up to order 15. Additionally, we provide an algorithm for GPU-accelerated quantile approximation based on parallel tree reduction. Experiments evaluate the accuracy and runtime of moment estimators against the state-of-the-art KLL quantile estimator on 14,072 real-world datasets drawn from the OpenML database. Our analysis highlights the effectiveness of variants of moment-based quantile approximation for highly space efficient summaries: their average performance using as few as five sample moments can approach the performance of a KLL sketch containing 500 elements. Experiments also illustrate the difficulty of applying the method reliably and showcases which moment-based approximations can be expected to fail or perform poorly.
- Naum I. Akhiezer. 1965. The Classical Moment Problem: And Some Related Questions in Analysis. Vol. 5. Oliver & Boyd, Edinburgh, UK.Google Scholar
- Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. 2017. OpenML Benchmarking Suites. arxiv:stat.ML/1708.03731Google Scholar
- Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. 1973. Time bounds for selection. Journal of Computer and System Sciences 7, 4 (Aug. 1973), 448--461. DOI:https://doi.org/10.1016/S0022-0000(73)80033-9Google ScholarDigital Library
- Nikolai N. Cencov. 1962. Estimation of an unknown distribution density from observations. Soviet Mathematics 3 (1962), 1559--1566.Google Scholar
- Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. 1983. Algorithms for computing the sample variance: Analysis and recommendations. American Statistician 37, 3 (1983), 242--247. http://www.jstor.org/stable/2683386Google ScholarCross Ref
- Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages, and Programming (ICALP’02). 693--703. http://dl.acm.org/citation.cfm?id--646255.684566.Google ScholarDigital Library
- John Cheng, Max Grossman, and Ty McKercher. 2014. Professional CUDA C Programming. John Wiley & Sons, Indianapolis, IN.Google Scholar
- Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarDigital Library
- Jiu Ding, Noah H. Rhee, and Chenhua Zhang. 2016. On polynomial maximum entropy method for classical moment problem. Advances in Applied Mathematics and Mechanics 8, 1 (2016), 117--127. DOI:https://doi.org/10.4208/aamm.2014.m504Google ScholarCross Ref
- Ted Dunning and Otmar Ertl. 2019. Computing extremely accurate quantiles using t-digests. arxiv:stat.CO/1902.04023Google Scholar
- Sam Efromovich. 2010. Orthogonal series density estimation. Wiley Interdisciplinary Reviews: Computational Statistics 2, 4 (2010), 467--476.Google ScholarDigital Library
- Message P Forum. 1994. MPI: A Message-Passing Interface Standard. Technical Report. MPI Forum, Knoxville, TN, USA.Google ScholarDigital Library
- Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-based quantile sketches for efficient high cardinality aggregation queries. Proceedings of the VLDB Endowment 11, 11 (2018), 1647--1660.Google ScholarDigital Library
- Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 2 (May 2001), 58--66. DOI:https://doi.org/10.1145/376284.375670Google ScholarDigital Library
- Nicholas J. Higham. 1993. The accuracy of floating point summation. SIAM Journal on Scientific Computing 14 (1993), 783--799.Google ScholarDigital Library
- Edwin T. Jaynes. 1957. Information theory and statistical mechanics. Physical Review 106, 4 (1957), 620.Google ScholarCross Ref
- Z. Karnin, K. Lang, and E. Liberty. 2016. Optimal quantile approximation in streams. In Proceedings of the 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS’16). IEEE, Los Alamitos, CA, 71--78.Google Scholar
- Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation. Parallel Computing 38, 3 (2012), 157--174.Google ScholarDigital Library
- Solomon Kullback. 1997. Information Theory and Statistics. Dover Publications, Mineola, NY.Google ScholarDigital Library
- Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. 2016. Quantiles over data streams: Experimental comparisons, new analyses, and further improvements. VLDB Journal 25, 4 (Aug. 2016), 449--472. DOI:https://doi.org/10.1007/s00778-016-0424-7Google ScholarDigital Library
- John C. Mason and David C. Handscomb. 2002. Chebyshev Polynomials. Chapman & Hall.Google Scholar
- Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch. Proceedings of the VLDB Endowment 12, 12 (Aug. 2019), 2195--2205. DOI:https://doi.org/10.14778/3352063.3352135Google ScholarDigital Library
- J. Ian Munro and Mike S. Paterson. 1980. Selection and sorting with limited storage. Theoretical Computer Science 12, 3 (1980), 315--323.Google ScholarCross Ref
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40--53. DOI:https://doi.org/10.1145/1365490.1365500Google ScholarDigital Library
- CUDA Nvidia. 2011. Nvidia Cuda C programming guide. Nvidia Corporation 120, 18 (2011), 8.Google Scholar
- Philippe Pébay, Timothy B. Terriberry, Hemanth Kolla, and Janine Bennett. 2016. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Computational Statistics 31, 4 (2016), 1305--1325.Google ScholarDigital Library
- William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. 2007. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK.Google ScholarDigital Library
- Carl Runge. 1901. Über empirische Funktionen und die Interpolation zwischen äquidistanten Ordinaten. Zeitschrift für Mathematik und Physik 46, 224–243 (1901), 20.Google Scholar
- Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys’04). ACM, New York, NY, 239--249. DOI:https://doi.org/10.1145/1031495.1031524Google ScholarDigital Library
- Michael Stephanou, Melvin Varughese, Iain Macdonald, et al. 2017. Sequential quantiles via Hermite series density estimation. Electronic Journal of Statistics 11, 1 (2017), 570--607.Google ScholarCross Ref
- Edward A. Youngs and Elliot M. Cramer. 1971. Some results relevant to choice of sum and sum-of-product algorithms. Technometrics 13, 3 (1971), 657--665.Google ScholarCross Ref
Index Terms
- An Empirical Study of Moment Estimators for Quantile Approximation
Recommendations
Control Variates for Quantile Estimation
New point and interval estimators for quantiles that employ a control variate are introduced. The properties of these estimators do not depend on the usual assumption of joint normality between the random variable of interest and the control. ...
New Multivariate Product Density Estimators
Let X be an Rd-valued random variable with unknown density f. Let X1, , Xn be i.i.d. random variables drawn from f. The objective is to estimate f(x), where x=(x1, , xd). We study the pointwise convergence of two new density estimates, the Hilbert ...
Multivariate moment based extreme value index estimators
Modeling extreme events is of paramount importance in various areas of science--biostatistics, climatology, finance, geology, and telecommunications, to name a few. Most of these application areas involve multivariate data. Estimation of the extreme ...
Comments