Combining family history and machine learning to link historical records: The Census Tree data set,Explorations in Economic History

当前位置： X-MOL 学术 › Explor. Econ. Hist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Combining family history and machine learning to link historical records: The Census Tree data set
Explorations in Economic History ( IF 1.857 ) Pub Date : 2021-02-15 , DOI: 10.1016/j.eeh.2021.101391
Joseph Price , Kasey Buckles , Jacob Van Leeuwen , Isaac Riley

A key challenge for research on many questions in the social sciences is that it is difficult to link records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we contribute to recent efforts to create these links with a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. We use these “true” links both to inform the decisions one needs to make when using automated methods to link records and as a training data set for use in a supervised machine learning approach. We describe our procedure and illustrate its potential by linking individuals across the 100% samples of the US censuses from 1900, 1910, and 1920. When linking adjacent censuses, we obtain an overall match rate of 62-65 percent (for over 88.9 million matches), with a false positive rate that is around 6-7 percent and with links that are similar to the population along observable characteristics. Thus, our method allows us to link records with a combination of a high match rate, precision, and representativeness that is beyond the current frontier. Finally, we demonstrate the potential of the data by estimating the degree of intergenerational transmission of literacy between father-son and mother-daughter pairs.

中文翻译：

结合家族历史和机器学习来链接历史记录：人口普查树数据集

对社会科学中许多问题进行研究的一个关键挑战是，很难以使调查员能够观察人们生活中不同世代或世代相传的方式来链接记录。在本文中，我们为使用新方法创建这些链接的最新工作做出了贡献，该方法依赖于由单个贡献者创建的数百万条记录链接，这些记录链接是大型的，公共的，Wiki风格的家谱。我们使用这些“真实”链接来告知使用自动方法链接记录时需要做出的决策，并作为在监督式机器学习方法中使用的训练数据集。我们通过在1900年，1910年和1920年的100％的美国人口普查样本中链接个人，来描述我们的程序并说明其潜力。当链接相邻的人口普查时，我们获得的总体匹配率为62-65％（超过8890万次匹配），假阳性率约为6％至7％，并且具有与可观察特征相似的总体链接。因此，我们的方法允许我们以超出当前边界的高匹配率，精度和代表性的组合来链接记录。最后，我们通过估计父子对和母子对之间的读写能力的代际传播程度来证明数据的潜力。

更新日期：2021-03-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>