当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Assessing the accuracy of record linkages with Markov chain based Monte Carlo simulation approach
Journal of Big Data ( IF 8.1 ) Pub Date : 2021-01-06 , DOI: 10.1186/s40537-020-00394-7
Shovanur Haque , Kerrie Mengersen , Steven Stern

Record linkage is the process of finding matches and linking records from different data sources so that the linked records belong to the same entity. There is an increasing number of applications of record linkage in statistical, health, government and business organisations to link administrative, survey, population census and other files to create a complete set of information for more complete and comprehensive analysis. To make valid inferences using a linked file, it has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. This motivates the development of a method for assessing the linking process and facilitating decisions about which linking method is likely to be more accurate for a particular linking task. This paper proposes a Markov Chain based Monte Carlo simulation approach, MaCSim for assessing a linking method and illustrates the utility of the approach using a realistic synthetic dataset received from the Australian Bureau of Statistics to avoid privacy issues associated with using real personal information. A linking method applied by MaCSim is also defined. To assess the defined linking method, correct re-link proportions for each record are calculated using our developed simulation approach. The accuracy is determined for a number of simulated datasets. The analyses indicated promising performance of the proposed method MaCSim of the assessment of accuracy of the linkages. The computational aspects of the methodology are also investigated to assess its feasibility for practical use.



中文翻译:

使用基于马尔可夫链的蒙特卡洛模拟方法评估记录链接的准确性

记录链接是查找匹配项并链接来自不同数据源的记录的过程,以便链接的记录属于同一实体。记录链接在统计,卫生,政府和商业组织中的应用越来越多,它们可以将行政,调查,人口普查和其他文件链接在一起,以创建一整套信息,以进行更完整,更全面的分析。为了使用链接的文件进行有效的推断,拥有有效且高效的方法来链接来自不同来源的数据已变得越来越重要。因此,有必要评估链接方法实现高精度的能力,或者在方法之间就准确性进行比较。这激励了一种方法的发展,该方法用于评估链接过程并促进关于哪种链接方法对于特定链接任务可能更准确的决策。本文提出了一种基于马尔可夫链的蒙特卡洛模拟方法,MaCSim用于评估链接方法,并使用从澳大利亚统计局收到的逼真的合成数据集来说明该方法的实用性,以避免与使用真实个人信息有关的隐私问题。还定义了MaCSim应用的链接方法。要评估定义的链接方法,请使用我们开发的仿真方法为每条记录计算正确的重新链接比例。确定了许多模拟数据集的准确性。分析表明,所提出的方法MaCSim对连杆精度的评估具有良好的性能。还研究了该方法的计算方面,以评估其实际应用的可行性。

更新日期:2021-01-07
down
wechat
bug