当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets.
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2003-07-01 , DOI: 10.1023/a:1024084221803
Greg Ridgeway 1 , David Madigan
Affiliation  

Markov chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need for statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on two examples. The first is a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 99% reduction in data accesses. The second example applies the method to Bayesian logistic regression and yields a 98% reduction in data accesses.

中文翻译:

海量数据集贝叶斯分析的顺序蒙特卡罗方法。

马尔可夫链蒙特卡洛 (MCMC) 技术通过提供基本工具包使贝叶斯分析的严谨性和灵活性在计算上变得实用,从而彻底改变了 1990 年代的统计实践。与此同时,海量数据集的日益流行和数据挖掘领域的扩展产生了对可扩展到这些大问题的统计上合理的方法的需求。除了最琐碎的例子,当前的 MCMC 方法需要在每次迭代时对数据集进行完整扫描,从而消除了它们作为可行数据挖掘技术的候选资格。在本文中,我们提出了一种使海量数据集的贝叶斯分析在计算上可行的方法。该算法根据后验分布进行模拟,该分布以数据集的更小、更易于管理的部分为条件。数据集的其余部分可以通过使用重要性采样重新加权初始绘制来合并。重要性权重的计算需要对剩余观测值进行一次扫描。虽然重要性采样提高了数据访问的效率,但它是以牺牲估计效率为代价的。一个简单的修改,基于动态系统模型的粒子过滤器中使用的“复兴”步骤,避免了效率损失,数据访问次数仅略有增加。两个例子。第一个是用于模拟网络流量和机器人技术的过渡模型的混合。对于这个例子,我们表明估计效率不受影响,同时数据访问减少了 99%。
更新日期:2019-11-01
down
wechat
bug