当前位置: X-MOL 学术J. Appl. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A simple two-step procedure using the Fellegi–Sunter model for frequency-based record linkage
Journal of Applied Statistics ( IF 1.5 ) Pub Date : 2021-05-04 , DOI: 10.1080/02664763.2021.1922615
Huiping Xu 1 , Xiaochun Li 1 , Shaun Grannis 2
Affiliation  

The widely used Fellegi–Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi–Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi–Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.



中文翻译:

使用 Fellegi-Sunter 模型进行基于频率的记录链接的简单两步程序

广泛使用的用于概率记录链接的 Fellegi-Sunter 模型没有利用字段值中包含的信息,因此无论记录是否同意稀有或常见值,都会导致匹配状态的相同分类。由于对稀有值的一致比对共同值的一致更不可能偶然发生,因此对稀有值达成一致的记录更有可能是匹配的。现有的基于频率的方法通常依赖于与字段值相关的错误概率和匹配之间商定字段值的频率的知识,通常使用先前的研究或训练数据得出。当此类信息不可用时,这些方法的应用具有挑战性。在本文中,我们提出了一个简单的两步程序,使用 Fellegi-Sunter 框架进行基于频率的匹配,以克服这些挑战。匹配权重根据匹配和非匹配之间商定字段值的频率分布进行调整,由 Fellegi-Sunter 模型估计,不依赖于先前的研究或训练数据。通过真实世界的应用和模拟,我们发现我们的方法比未调整的方法产生了可比或更好的性能。此外,当使用辨别力差的场时,基于频率的匹配在匹配精度方面提供了更大的改进,而随着匹配场的辨别力的增加,其收益会降低。由 Fellegi-Sunter 模型估计,不依赖于先前的研究或训练数据。通过真实世界的应用和模拟,我们发现我们的方法比未调整的方法产生了可比或更好的性能。此外,当使用辨别力差的场时,基于频率的匹配在匹配精度方面提供了更大的改进,而随着匹配场的辨别力的增加,其收益会降低。由 Fellegi-Sunter 模型估计,不依赖于先前的研究或训练数据。通过真实世界的应用和模拟,我们发现我们的方法比未调整的方法产生了可比或更好的性能。此外,当使用辨别力差的场时,基于频率的匹配在匹配精度方面提供了更大的改进,而随着匹配场的辨别力的增加,其收益会降低。

更新日期:2021-05-04
down
wechat
bug