Efficient Optimization of Partition Scan Statistics via the Consecutive Partitions Property,Journal of Computational and Graphical Statistics

当前位置： X-MOL 学术 › J. Comput. Graph. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient Optimization of Partition Scan Statistics via the Consecutive Partitions Property
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2022-06-29 , DOI: 10.1080/10618600.2022.2077351
Charles A. Pehlivanian ₁ , Daniel B. Neill ₂

Affiliation

Abstract–We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as $O (n^{t - 1})$ , making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in $O (n^{2} t)$ time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online.

中文翻译：

通过连续分区属性有效优化分区扫描统计

摘要 -我们将空间和子集扫描统计数据从单个子集情况推广到多个子集情况。考虑了在单一子集情况下定义对数似然比统计的两种主要方法——基于总体和基于期望的扫描统计——分别导致风险划分和多聚类检测扫描统计。我们表明，对于可分离指数族中的分布，风险分区扫描统计量可以表示为缩放的f-归一化计数和基线向量的散度，以及作为缩放 Bregman 散度之和的多簇检测扫描统计量。然而，无论哪种情况，通过对数据的所有分区进行穷举搜索来最大化扫描统计数据都需要指数时间。为了使这种优化在计算上可行，我们证明了保证最佳划分是连续的充分条件。此连续分区属性将线性时间子集扫描属性从两个分区（检测到的子集和剩余数据元素）推广到多分区情况。而将n 个元素连续划分为t 个分区的数量则为 $氧（ n^{t - 1} ）$ ，这使得大t的计算成本很高，我们提出了一种动态规划方法，该方法可以识别中的最佳连续分区 $氧（ n^{2} t ）$ 时间，从而可以准确有效地解决大规模风险划分和多集群检测问题。最后，我们使用模拟和真实数据演示分区扫描统计的检测性能和实用性。本文的补充材料可在线获取。

更新日期：2022-06-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11