当前位置: X-MOL 学术Comput. Stat. Data Anal. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast multivariate empirical cumulative distribution function with connection to kernel density estimation
Computational Statistics & Data Analysis ( IF 1.8 ) Pub Date : 2021-05-13 , DOI: 10.1016/j.csda.2021.107267
Nicolas Langrené , Xavier Warin

The problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets, is revisited. Computing an ECDF at one evaluation point requires O(N) operations on a dataset composed of N data points. Therefore, a direct evaluation of ECDFs at N evaluation points requires a quadratic O(N2) operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a O(NlogN) complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a O(Nlog(N)(d1)1) complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general d-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, a direct connection between cumulative distribution functions and kernel density estimation (KDE) is established for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.



中文翻译:

快速多变量经验累积分布函数,与核密度估计有关

再次讨论了在大型多元数据集上有效地计算经验累积分布函数(ECDF)的问题。在一个评估点计算ECDF要求Øñ对由N个数据点组成的数据集执行的操作。因此,在N个评估点对ECDF进行直接评估需要二次Øñ2个操作,这对于大规模问题是禁止的。提出并比较了两种快速准确的方法。第一个基于字典顺序的快速求和,带有Øñ日志ñ复杂性,并且要求评估点位于规则的网格上。第二个是基于分而治之的原则,Øñ日志ñd-1个1个复杂性,要​​求评估点与输入点重合。在一般的d维情况下描述并详细介绍了这两种快速算法,数值实验验证了它们的速度和准确性。其次,为一大类内核建立了累积分布函数和内核密度估计(KDE)之间的直接联系。这种连接为多元核密度估计和核回归的快速精确算法铺平了道路。用Laplacian内核进行的数值测试验证了所提出算法的速度和准确性。所提出的数值方法可以使大量的大规模多元密度估计,累积分布估计,生存函数估计和回归问题受益。

更新日期:2021-05-26
down
wechat
bug