Computational Statistics & Data Analysis ( IF 1.8 ) Pub Date : 2021-05-13 , DOI: 10.1016/j.csda.2021.107267 Nicolas Langrené , Xavier Warin
The problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets, is revisited. Computing an ECDF at one evaluation point requires operations on a dataset composed of N data points. Therefore, a direct evaluation of ECDFs at N evaluation points requires a quadratic operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general d-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, a direct connection between cumulative distribution functions and kernel density estimation (KDE) is established for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.
中文翻译:
快速多变量经验累积分布函数,与核密度估计有关
再次讨论了在大型多元数据集上有效地计算经验累积分布函数(ECDF)的问题。在一个评估点计算ECDF要求对由N个数据点组成的数据集执行的操作。因此,在N个评估点对ECDF进行直接评估需要二次操作,这对于大规模问题是禁止的。提出并比较了两种快速准确的方法。第一个基于字典顺序的快速求和,带有复杂性,并且要求评估点位于规则的网格上。第二个是基于分而治之的原则,复杂性,要求评估点与输入点重合。在一般的d维情况下描述并详细介绍了这两种快速算法,数值实验验证了它们的速度和准确性。其次,为一大类内核建立了累积分布函数和内核密度估计(KDE)之间的直接联系。这种连接为多元核密度估计和核回归的快速精确算法铺平了道路。用Laplacian内核进行的数值测试验证了所提出算法的速度和准确性。所提出的数值方法可以使大量的大规模多元密度估计,累积分布估计,生存函数估计和回归问题受益。