Random Partition Models for Microclustering Tasks,Journal of the American Statistical Association

当前位置： X-MOL 学术 › J. Am. Stat. Assoc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Random Partition Models for Microclustering Tasks
Journal of the American Statistical Association ( IF 3.0 ) Pub Date : 2020-12-08 , DOI: 10.1080/01621459.2020.1841647
Brenda Betancourt ₁ , Giacomo Zanella ₂ , Rebecca C. Steorts ₃

Affiliation

Abstract

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution (ER), modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points—the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of ER, where we provide a simulation study and real experiments on survey panel data.

中文翻译：

微集群任务的随机分区模型

摘要

传统的贝叶斯随机分区模型假设每个簇的大小随着数据点的数量线性增长。虽然这对某些应用程序很有吸引力，但这种假设不适用于其他任务，例如实体解析 (ER)、稀疏网络建模和 DNA 测序任务。此类应用程序需要生成大小随数据点总数呈亚线性增长的集群的模型——微集群属性. 受这些问题的启发，我们提出了一类通用的随机分区模型，该模型满足具有良好表征的理论特性的微聚类特性。我们提出的模型克服了现有文献中关于微聚类模型的主要限制，即缺乏可解释性、可识别性和模型渐近特性的完整表征。至关重要的是，我们放弃了具有可交换数据点序列的经典假设，而是假设了可交换的集群序列。此外，我们的框架在集群大小的先验分布、计算易处理性以及对大量微集群任务的适用性方面提供了灵活性。我们建立了所得先验类别的理论性质，其中我们描述了集群数量和给定大小的集群比例的渐近行为。我们的框架允许使用简单高效的马尔可夫链蒙特卡罗算法来执行统计推断。我们说明了我们提出的关于 ER 微聚类任务的方法，其中我们提供了对调查面板数据的模拟研究和真实实验。

更新日期：2020-12-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11