Large Scale Record Linkage in the Presence of Missing Data

Ranbaduge, Thilina; Christen, Peter; Schnell, Rainer

Abstract:Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.

Comments:	9 pages
Subjects:	Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:	arXiv:2104.09677 [cs.DB]
	(or arXiv:2104.09677v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2104.09677

Computer Science > Databases

Title:Large Scale Record Linkage in the Presence of Missing Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators