Journal of Applied Statistics ( IF 1.5 ) Pub Date : 2021-04-06 , DOI: 10.1080/02664763.2021.1911967 Xie Xiaoyue 1, 2 , Jian Shi 1, 2 , Kai Song 3
ABSTRACT
When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. This paper mainly develops and investigates a distributed two-node Kolmogorov–Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In addition, this paper also provides a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed hypothesis testing scheme. The distributed fraud detection is to detect which node stores fraud data in multi-node machines and the distribution-based classification is to determine whether the multi-node distributions differ and classify different distributions. These methods can improve the accuracy of statistical inference in a distributed storage architecture. Furthermore, this paper verifies the feasibility of the proposed methods by simulation and real example studies.
中文翻译:
一种面向海量数据的分布式多样本测试
摘要
当数据以分布式方式存储时,由于通信成本和隐私问题,直接应用传统的假设检验程序通常是禁止的。本文主要开发和研究分布式双节点 Kolmogorov-Smirnov 假设检验方案,采用分而治之策略实现。此外,本文还基于所提出的假设检验方案,为多节点机器提供分布式欺诈检测和基于分布的分类。分布式欺诈检测是检测多节点机器中哪个节点存储欺诈数据,基于分布的分类是判断多节点分布是否不同并对不同分布进行分类。这些方法可以提高分布式存储架构中统计推断的准确性。此外,本文通过仿真和实例研究验证了所提方法的可行性。