Combining family history and machine learning to link historical records: The Census Tree data set

https://doi.org/10.1016/j.eeh.2021.101391Get rights and content

Abstract

A key challenge for research on many questions in the social sciences is that it is difficult to link records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we contribute to recent efforts to create these links with a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. We use these “true” links both to inform the decisions one needs to make when using automated methods to link records and as a training data set for use in a supervised machine learning approach. We describe our procedure and illustrate its potential by linking individuals across the 100% samples of the US censuses from 1900, 1910, and 1920. When linking adjacent censuses, we obtain an overall match rate of 62-65 percent (for over 88.9 million matches), with a false positive rate that is around 6-7 percent and with links that are similar to the population along observable characteristics. Thus, our method allows us to link records with a combination of a high match rate, precision, and representativeness that is beyond the current frontier. Finally, we demonstrate the potential of the data by estimating the degree of intergenerational transmission of literacy between father-son and mother-daughter pairs.

Introduction

For many of the most pressing questions in the social sciences, empirical analysis relies on access to data that allow the researcher to observe people at different points in their life or across generations. For example, to measure the intergenerational transmission of socio-economic status, we need to be able to link a parent to his or her adult child; to estimate the long-term impacts of childhood experiences, we typically need to observe a person as both a child and as an adult. Unfortunately, this type of data has been hard to come by in the United States, since census data and many administrative data sets lack a consistent individual registration number (as exists, for example, in Sweden and Norway).

Recently, researchers studying the U.S. have created large linked samples in three ways. One approach is to employ restricted-use data with unique individual identifiers that permit linking. This includes work that uses Social Security numbers to link tax records across generations (Chetty and Hendren, 2018), to education histories (Chetty et al., 2017), or to survey data (Mazumder and Davis, 2013). Another strategy is to link individuals across records by matching on characteristics such as the person's name, birth year, and birthplace (Abramitzky et al., 2014; Evans et al., 2016; Ferrie, 1996; Abramitzky, Mill, and Pérez 2018). A third approach is to use supervised machine learning algorithms. Intuitively, machine learning methods use a training set with examples of both correct and incorrect matches to “learn” which features of the data best predict a correct match. This information can be used to create an algorithm to identify new matches (Christen, 2012; Feigenbaum, 2016; Folkman et al., 2018).1 Each of these approaches have their advantages and disadvantages, and they are likely to complement each other in a combined effort to link as many individuals across historical records as possible.

In this paper, we propose a new approach for linking individuals across historical records that builds on and incorporates many of these other methods. Our unique contribution arises from our use of a data set created from decisions made by millions of people doing research on their own family histories. These researchers often gather source documents—including census records—to establish various life events and relationships for a family member, then post their conclusions on genealogical websites like Ancestry, FamilySearch, FindMyPast, MyHeritage, Geni, and Wikitree. The key feature we exploit is that when the profile for a deceased individual on one of these websites has multiple sources attached, each pair of these sources establishes a “true” match.2 These matches can be used to inform the decisions made when employing various linking strategies and can also be used as training data for supervised machine learning methods. The data are highly reliable, as many of the links are created by family members who have a personal interest in making a correct match. Furthermore, these family members often have private information that can be used to identify the person of interest across multiple data sets, such as maiden names or the names of other household members.

The genealogy platform we use for our study is FamilySearch. FamilySearch has created a large, public, wiki-style family tree that includes a profile for over 1.2 billion deceased individuals with over 12.6 million registered users who can contribute information to those profiles. Individuals can upload information and sources to the profiles of their own ancestors and relatives and can make edits to the information and sources attached by other contributors working on the same people. In addition, FamilySearch provides regular record hints as suggestions to these contributors, who then decide whether or not the source should be attached to that person. We use a sample of individuals from this family tree that are attached to at least two census records between 1900 and 1920. This provides a data set with 14.5 million 1900–1910 links, 16.3 million 1910-1920 links, and 9.8 million 1900-1920 links.

We describe a process that uses these data to create millions more links among these three censuses. First, the FamilySearch data allow us to examine several important decisions that need to be made when using automated methods to link historical records. These decisions include how to pre-process the data, which features to use to identify potential matches, and which machine learning algorithm to use. We show how key properties of the data—precision, recall, and representativeness—respond to these choices, and we demonstrate the potential for transfer learning with our algorithm. Second, we use the FamilySearch links as training data for a supervised machine learning algorithm and combine the links we get from this machine learning approach with other methods to link records. Our final data set, which we call the “Census Tree” data, contains 61.6% of the potential matches between the 1900 and 1910 full-count US censuses, and 65.2% of the potential matches between the 1910 and 1920 full-count US censuses (or 38.8 and 50.1 million matches, respectively).

In Section V of the paper, we summarize the properties of the Census Tree data set and compare it to other linking methods and efforts. First, we show that people who are linked to a prior census in the Census Tree are similar to the full population in terms of gender, age, household characteristics, and occupation score, but that white Americans and those who were born in their birth state are over-represented. We also hand-check a random sample of our predicted matches and use a transitivity test to show that the false positive rate among our predicted matches is about 7%. Prior work has documented a “production possibilities frontier” that shows the tradeoff between recall (finding more matches) and precision (avoiding false positives) (Abramitzky et al., 2019); we show that when combining our methods with those of others, we achieve a combination of these two qualities that is beyond the frontier achieved by the groundbreaking Census Linking Project, or CLP (Abramitzky et al., 2020). Moreover, our method adds tens of millions of new links among these censuses, including many links among women (who are excluded entirely from the CLP) and minority groups.

As an application to demonstrate the potential of the data, we produce estimates of the intergenerational transmission of literacy between parents in the 1900 Census and their adult children in the 1920 Census. We are able to do this separately for Black and white Americans, and for both father-son and mother-daughter pairs. We show that our estimation samples are larger than those currently available from the CLP and are more representative of the full population than either the FamilySearch links alone or the CLP links. The estimates suggest that the greater precision of the Census Tree links works to mitigate the well-known problem of attenuation bias from measurement error due to incorrect links (Solon, 1992).

Ultimately, our goal for this project is to create every possible link among the full-count US decennial censuses from 1850 to 1940, and to make these links available to other researchers.3 But the method we describe in this paper could be applied to any pair of data sets for which there is a sufficient number of individuals with records in both collections linked by users on a genealogy platform. Moreover, the potential for transfer learning from the census links will likely aid efforts to create new links among these data sets. As a result, the potential of the methods introduced in this paper will grow even beyond this ambitious goal, as the use of genealogy websites like FamilySearch spreads around the globe and as more historical records are digitized.4

Section snippets

Background

The 100 percent samples of the US decennial censuses are made available to the public after 72 years, which creates the possibility for linking individuals over long periods of time. Several approaches have been used by social scientists to create large linked samples. These include creating pre-determined rules to identify unique matches (Abramitzky et al., 2014; Alexander and Ward, 2018; Beach et al., 2016; Collins and Wanamaker, 2015; Ferrie, 1996), employing a statistical algorithm such as

Data

We use two sources of data for this project. The first data set is the 100% sample of the US decennial census for 1900, 1910, and 1920. These data provide the raw records that we aim to link together and include characteristics such as the person's name, birth year, birthplace, gender, race, place of residence, and the birthplaces of their father and mother. All of these variables were transcribed (or indexed) from the original digitized images by volunteers recruited by FamilySearch. The data

Method

We begin by describing how we use the Family Tree data as training data for our supervised machine learning methods, and as a resource for making informed decisions about how to pre-process the data, which blocking and matching features to choose, and which machine learning algorithm to use. We then describe the full pipeline that we use to link census records, which includes both supervised and unsupervised methods.

Results

We now apply this process to produce a linked data set of individuals across the 1900, 1910 and 1920 US censuses. Table 6 reports the number of links we are able to make between adjacent censuses, using each of the strategies described in the previous section. We provide the total number of matches that are obtained from each strategy as well as the number of new matches obtained when we apply the methods in sequence. Focusing on links between the 1910 and 1920 censuses, we see that our XGBoost

Application: The Intergenerational Transmission of Literacy

As a final demonstration of the potential of our data for research, we calculate estimates of the intergenerational transmission of literacy using three different samples—the links taken directly from the Family Tree, links from our full Census Tree (which include the Family Tree links), and links from the CLP. To do this, we estimate OLS regressions where the dependent variable is a dummy variable indicating that the child is literate, and the independent variable is a dummy variable for the

Accessing the data and code for research

We are committed to making the data and methods that we have described in this paper available to other researchers. To link the restricted version of the census to a subset of our training data from the Family Tree, researchers will first need to obtain access to the restricted versions of the complete count censuses for the relevant years. They can then use the data and code that we provide in our Open ICPSR repository to create the links based on our machine learning model (Price and

Conclusion

Recent developments in data access and record linking methodology have created exciting opportunities for social science research using large populations (Gutmann et al., 2018). We contribute to this work by developing novel ways to use data created from the contributions of millions of individuals who are investigating their own family histories on FamilySearch, a genealogy web platform. These researchers often gather records from censuses and other sources and link them together on a family

Declaration of Competing Interest

None.

References (38)

  • D. Costa et al.

    Data set from the union army samples to study locational choice and social networks

    Data Brief

    (2018)
  • R. Abramitzky et al.

    A nation of immigrants: Assimilation and economic outcomes in the age of mass migration

    J. Polit. Econ.

    (2014)
  • R. Abramitzky et al.

    Automated linking of historical data

    J. Econ. Lit.

    (2019)
  • R. Abramitzky et al.

    Linking individuals across historical sources: a fully automated approach

    Historic. Methods

    (2020)
  • R. Alexander et al.

    Age at arrival and assimilation during the age of mass migration

    J. Econ. History

    (2018)
  • L. Antoine et al.

    Selection bias encountered in the systematic linking of historical census records

    Soc. Sci. History

    (2020)
  • M. Bailey et al.

    How well do automated methods perform in historical samples? Evidence from new ground truth

    J. Econ. Lit.

    (2019)
  • B. Beach et al.

    Typhoid fever, water quality, and human capital formation

    J. Econ. History

    (2016)
  • Charles, K., T. Eastmond, J. Price, and D. Rees. “Long-run consequences of prejudice.” Working paper....
  • T. Chen et al.

    XGBoost: a scalable tree boosting system

  • R. Chetty et al.

    Mobility Report Cards: The Role of Colleges in Intergenerational Mobility

    (2017)
  • R. Chetty et al.

    The impacts of neighborhoods on intergenerational mobility I: childhood exposure effects

    Q. J. Econ.

    (2018)
  • P. Christen

    Data Matching

    (2012)
  • W. Collins et al.

    The great migration in black and white: new evidence on the selection and sorting of southern migrants

    J. Econ. History

    (2015)
  • M. Evans et al.

    The developmental effect of state alcohol prohibitions at the turn of the twentieth century

    Econ. Inq.

    (2016)
  • Feigenbaum, JJ. “Automated census record linking: a machine learning approach.” Working Paper,...
  • JJ. Feigenbaum

    Multiple measures of historical intergenerational mobility: Iowa 1915 to 1940

    Econ. J.

    (2018)
  • J. Ferrie

    A new sample of americans linked from the 1850 public use micro sample of the federal census of population to the 1860 federal census manuscript

    Historic. Methods

    (1996)
  • T. Folkman et al.

    GenERes: a genealogical entity resolution system

  • Cited by (18)

    • Record linkage for character-based surnames: Evidence from chinese exclusion

      2023, Explorations in Economic History
      Citation Excerpt :

      This result is second only to the BYU Census Tree, which links 12,855 individuals in this year pair. The Census Tree can be considered an upper-bound estimate of a semi-automated approach, given the volume of their individually validated training data (Price et al., 2021). Both ABE and MLP demonstrate higher match rates than for 1880–1900, but are still significantly outstripped by the method proposed in this paper.

    View all citing articles on Scopus

    This work has been supported in part by grant #G-1063 from the Russell Sage Foundation. Any opinions expressed are those of the principal investigators alone and should not be construed as representing the opinions of the Foundation. We are grateful for helpful comments and assistance from Ran Abramitzky, Martha Bailey, Katherine Eriksson, James Feigenbaum, Joe Ferrie, Ian Fillmore, Cathy Fitch, Brigham Frandsen, Katie Genadek, Jonas Helgertz, Bob Pollack, Steve Ruggles, and Anne Winkler. This project would not have been possible without the careful and thoughtful work of many research assistants at the Brigham Young University Record Linking Lab and at Notre Dame, including Ben Branchflower, Ben Busath, Alison Doxey, Neil Duzett, Laren Edwards, Brianna Felegi, Nicholas Grasley, Adrian Haws, Brandon Ly, Amanda Marsden, Jalen Morgan, Daniel Sabey, and Joseph Young.

    View full text