An Optimal Transport Approach for Selecting a Representative Subsample with Application in Efficient Kernel Density Estimation,Journal of Computational and Graphical Statistics

当前位置： X-MOL 学术 › J. Comput. Graph. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Optimal Transport Approach for Selecting a Representative Subsample with Application in Efficient Kernel Density Estimation
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2022-07-05 , DOI: 10.1080/10618600.2022.2084404
Jingyi Zhang ₁ , Cheng Meng ₂ , Jun Yu ₃ , Mengrui Zhang ₄ , Wenxuan Zhong ₄ , Ping Ma ₄

Affiliation

Abstract

Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this article, we study model-free subsampling methods, which aim to identify a subsample, that is, not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by using optimal transport techniques. Moreover, we develop an efficient subsampling algorithm, that is, adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.

中文翻译：

一种用于选择代表性子样本的最佳传输方法及其在有效核密度估计中的应用

摘要

二次抽样方法旨在选择一个子样本作为观察样本的替代品。近几十年来，此类方法已广泛用于大规模数据分析、主动学习和隐私保护分析。在本文中，我们研究了无模型子采样方法，而不是基于模型的方法，其目的是识别子样本，即不受模型假设的限制。现有的无模型子采样方法通常建立在聚类技术或内核技巧之上。这些方法中的大多数要么计算量大，要么理论上存在缺陷。特别是，理论上的弱点是所选子样本的经验分布不一定收敛于总体分布。这种计算和理论限制阻碍了无模型子采样方法在实践中的广泛适用性。我们通过使用最优传输技术提出了一种新的无模型子采样方法。此外，我们开发了一种有效的子采样算法，即自适应未知概率密度函数。从理论上讲，我们通过推导所提出的子样本核密度估计器的收敛速度，表明所选的子样本可用于有效的密度估计。我们还为建议的估计器提供了最佳带宽。对合成数据集和真实数据集的数值研究表明，所提出方法的性能优越。即适应未知的概率密度函数。从理论上讲，我们通过推导所提出的子样本核密度估计器的收敛速度，表明所选的子样本可用于有效的密度估计。我们还为建议的估计器提供了最佳带宽。对合成数据集和真实数据集的数值研究表明，所提出方法的性能优越。即适应未知的概率密度函数。从理论上讲，我们通过推导所提出的子样本核密度估计器的收敛速度，表明所选的子样本可用于有效的密度估计。我们还为建议的估计器提供了最佳带宽。对合成数据集和真实数据集的数值研究表明，所提出方法的性能优越。

更新日期：2022-07-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11