Clustering based on Kolmogorov–Smirnov statistic with application to bank card transaction data,The Journal of the Royal Statistical Society: Series C (Applied Statistics)

当前位置： X-MOL 学术 › J. R. Stat. Soc. Ser. C Appl. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clustering based on Kolmogorov–Smirnov statistic with application to bank card transaction data
The Journal of the Royal Statistical Society: Series C (Applied Statistics) ( IF 1.0 ) Pub Date : 2021-03-25 , DOI: 10.1111/rssc.12471
Yingqiu Zhu ₁ , Qiong Deng ₁ , Danyang Huang ₁ , Bingyi Jing ₂ , Bo Zhang ₁

Affiliation

Rapid developments in third-party online payment platforms now make it possible to record massive bank card transaction data. Clustering on such transaction data is of great importance for the analysis of merchant behaviours. However, traditional methods based on generated features inevitably lead to much loss of information. To make better use of bank card transaction data, this study investigates the possibility of using the empirical cumulative distribution of transaction amounts. As the distance between two merchants can be measured using the two-sample Kolmogorov–Smirnov test statistic, we propose the Kolmogorov–Smirnov K-means clustering approach based on this distance measure. An approximation step is conducted to ensure the feasibility of the proposed method even for large-scale transaction data, and the associated theoretical properties are investigated. Both simulations and an empirical study demonstrate that our method outperforms feature-based methods and is computationally efficient for large-scale data sets.

中文翻译：

基于 Kolmogorov-Smirnov 统计的聚类并应用于银行卡交易数据

第三方在线支付平台的快速发展使得记录海量银行卡交易数据成为可能。对此类交易数据进行聚类对于分析商家行为非常重要。然而，传统的基于生成特征的方法不可避免地会导致大量的信息丢失。为了更好地利用银行卡交易数据，本研究调查了使用交易金额经验累积分布的可能性。由于可以使用双样本 Kolmogorov-Smirnov 检验统计量来测量两个商家之间的距离，因此我们提出了基于该距离度量的 Kolmogorov-Smirnov K 均值聚类方法。进行近似步骤以确保所提出的方法即使对于大规模交易数据的可行性，并研究了相关的理论性质。模拟和实证研究都表明，我们的方法优于基于特征的方法，并且对于大规模数据集具有计算效率。

更新日期：2021-03-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文