An Approach to Incorporate Subsampling Into a Generic Bayesian Hierarchical Model,Journal of Computational and Graphical Statistics

当前位置： X-MOL 学术 › J. Comput. Graph. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Approach to Incorporate Subsampling Into a Generic Bayesian Hierarchical Model
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2021-06-21 , DOI: 10.1080/10618600.2021.1923518
Jonathan R Bradley ₁

Affiliation

Abstract

The goal of this article is to provide a way for Bayesian statisticians to incorporate subsampling directly into the Bayesian hierarchical model of their choosing without imposing additional restrictive model assumptions. We are motivated by the fact that the rise of “big data” has created difficulties for statisticians to directly apply their methods to big datasets. We introduce a “data subset model” to the popular “data model, process model, and parameter model” framework used to summarize Bayesian hierarchical models. The hyperparameters of the data subset model are specified constructively in that they are chosen such that the implied size of the subset satisfies predefined computational constraints. Thus, these hyperparameters effectively calibrate the statistical model to the computer itself to obtain predictions/estimations in a prespecified amount of time. Several properties of the data subset model are provided including: propriety, partial sufficiency, and semi-parametric properties. Simulated datasets will be used to assess the consequences of subsampling, and results will be presented across different computers to show the effect of the computer on the statistical analysis. Additionally, we provide a joint analysis of a high-dimensional dataset (roughly 10 gigabytes) consisting of 2018 5-year period estimates from the U.S. Census Bureau’s Public Use Micro-Sample (PUMS).

中文翻译：

一种将二次抽样纳入通用贝叶斯层次模型的方法

摘要

本文的目标是为贝叶斯统计学家提供一种方法，将二次抽样直接纳入他们选择的贝叶斯层次模型中，而无需强加额外的限制性模型假设。我们的动机是“大数据”的兴起给统计学家直接将他们的方法应用于大数据集带来了困难。我们在用于总结贝叶斯层次模型的流行“数据模型、过程模型和参数模型”框架中引入了“数据子集模型”。数据子集模型的超参数被建设性地指定，因为它们被选择为使得子集的隐含大小满足预定义的计算约束。因此，这些超参数有效地将统计模型校准到计算机本身，以在预先指定的时间内获得预测/估计。提供了数据子集模型的几个属性，包括：适当性、部分充分性和半参数属性。模拟数据集将用于评估二次抽样的后果，结果将在不同的计算机上呈现，以显示计算机对统计分析的影响。此外，我们提供了对高维数据集（大约 10 GB）的联合分析，该数据集包含来自美国人口普查局公共用途微样本 (PUMS) 的 2018 年 5 年期估计值。模拟数据集将用于评估二次抽样的后果，结果将在不同的计算机上呈现，以显示计算机对统计分析的影响。此外，我们提供了对高维数据集（大约 10 GB）的联合分析，该数据集包含来自美国人口普查局公共用途微样本 (PUMS) 的 2018 年 5 年期估计值。模拟数据集将用于评估二次抽样的后果，结果将在不同的计算机上呈现，以显示计算机对统计分析的影响。此外，我们提供了对高维数据集（大约 10 GB）的联合分析，该数据集包含来自美国人口普查局公共用途微样本 (PUMS) 的 2018 年 5 年期估计值。

更新日期：2021-06-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11