当前位置: X-MOL 学术Biometrika › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical properties of sketching algorithms
Biometrika ( IF 2.4 ) Pub Date : 2020-07-30 , DOI: 10.1093/biomet/asaa062
D C Ahfock 1 , W J Astle 1 , S Richardson 1
Affiliation  

Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-Woodruff sketches, and their use in single pass sketching algorithms for linear regression with huge $n$. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketched estimators. A key result is a conditional central limit theorem for data oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean square error is related to the signal to noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two real datasets.

中文翻译:

草图算法的统计特性

草图绘制是一种概率数据压缩技术,主要在计算机科学界开发。大数据集上的数值运算可能会慢得难以忍受;草图算法通过生成更小的代理数据集来解决这个问题。通常,推理是在压缩数据集上进行的。草图算法通常使用随机投影来压缩原始数据集,这种随机生成过程使它们适合统计分析。我们认为,草图数据可以建模为随机样本,从而将这一系列数据压缩方法牢固地置于推理框架内。我们特别关注高斯草图、Hadamard 草图和 Clarkson-Woodruff 草图,以及它们在具有巨大 $n$ 的线性回归的单通道草图算法中的使用。我们探索了草图回归算法的统计特性,并为一大类草图估计器导出了新的分布结果。一个关键结果是数据不经意草图的条件中心极限定理。一个重要的发现是,就均方误差而言,草图算法的最佳选择与源数据集中的信噪比有关。最后,我们证明了该理论及其在两个真实数据集上的适用性限制。
更新日期:2020-07-30
down
wechat
bug