Improving estimates of neighborhood change with constant tract boundaries
Introduction
Social scientists routinely rely on methods of interpolation to adjust available data to their research needs. Spatial data from different sources often are based on different geographies that need to be reconciled, and some boundaries (e.g., administrative or political boundaries) change frequently. This study calls attention to the potential for substantial error in efforts to harmonize data to constant boundaries using standard approaches of areal and population interpolation. The case in point is census tract boundaries in the United States, which are redefined before every decennial census. We study the accuracy of standard methods of harmonizing such data over time to deal with changing boundaries. Previous research by Logan et al. (2016) took advantage of the public release of population counts from the 2000 Census using 2010 boundaries, showing that estimates using current methods from the Geographic Information Systems (GIS) toolkit were close to the true values in most tracts. We confirm that finding here, drawing on the original confidential 2000 data in a Federal Statistical Research Data Center (FSRDC). We also compare the “true” and estimated values of a selected set of other population characteristics. We find first that there is much more error in these estimates than in estimates of simple population counts. Second, an alternative approach that injects random noise into the true values, so that they can be publicly disclosed, provides considerably more accurate estimates. And third, there are identifiable conditions under which the interpolation-based tract estimate is more prone to error and therefore requires closer attention by researchers.
Section snippets
Geographical approaches to adjusting boundaries
Geographers have devoted much attention to the effect of discrepancies in the boundaries of the areal units used for the analysis of spatial data. A typical situation is when data are being drawn from different sources. For example, population data may be reported in census tracts, while crime data may be reported in police precincts, or election data in voting districts, or school data in school attendance zones. Another situation, the one we tackle here, is when there are changes over time in
Interpolation to harmonize census tract boundaries over time
We consider here the specific case of harmonizing data for census tracts in the United States over time, and we point to a critical source of error in the available interpolated estimates. The earliest national source was the Neighborhood Change Data Base (NCDB, originally developed by the Urban Institute) that first became available in 2002 and was quickly adopted by most social scientists studying tract data in the 1970–2000 period. Its great attraction was that for the first time it promised
Assessing interpolation estimates and a new alternative
We now turn to an effort to gauge the quality of the estimates provided by the LTDB. Records held in a Federal Statistical Research Data Center (RDC) allow us to determine the 2010 tract area where persons and households lived when enumerated in the 2000 Census, either for short form (intending to cover the full population) or long form (covering one in six households) samples. We can then aggregate these 2000 census records within 2010 geography to provide the best, unbiased estimate of the
Research design
This study includes all populated census tracts in 2000 and 2010 in the continental United States. These tracts can be categorized according to how their 2000 and 2010 boundaries compare. We treat as “unchanged” those cases where the difference in boundaries between a tract in year 1 and year 2 involves less than 1 percent of the land area of the year 2 tract. There are three main categories of changes: consolidations, splits, and complex changes. Consolidation is when several 2000 tracts are
Errors in LTDB and DP estimates
Table 2 summarizes the level of errors in the LTDB and DP estimates in terms of RMSE for every type of tract and for all six population variables. The key finding with respect to the purpose of this study is that the LTDB estimates of total population are much better than estimates of the under 18, non-Hispanic white, college-educated, and homeowner populations. This differential is small for unchanged and consolidated tracts. Here no interpolation was conducted for the LTDB; data were taken
Conclusion
The results are clear. Although interpolated estimates of tract “total population” are very reliable, there is even less error in the DP estimates. For other demographic characteristics, interpolation introduces considerable error, while the DP estimates are generally very close to the true values. How great is the problem? In a substantial share of cases for tracts with complex boundary changes, the LTDB estimates differ from the true value by five or ten percent or more.
Fortunately, though
Declaration of interest
No potential conflict of interest was reported by the authors.
Data availability statement
The publicly available data analyzed here are available at this website: “https://s4.ad.brown.edu/Projects/Diversity/Researcher/Bridging.htm” \o "https://s4.ad.brown.edu/Projects/Diversity/Researcher/Bridging.htm".
Acknowledgments
This research was supported by the Sociology Program of the National Science Foundation (grant 1756567). The Population Studies and Training Center at Brown University (P2CHD041020) provided general support. We thank John Friedman (Brown University) and Adam Smith (Boston University) for assistance with differential privacy methods. All results have been reviewed by the U.S. Census Bureau to ensure that no confidential information is disclosed (approval number CBDRB-FY20-208). Any opinions and
References (24)
- et al.
Using areal interpolation methods in geographic information systems
Papers in Regional Science
(1991) The overlaid network algorithms for areal interpolation problem
Computers, Environment and Urban Systems
(1995)- et al.
The generation of spatial population distributions from census centroid data
Environment & Planning A
(1989) - et al.
Linkage of the 1981 and 1991 UK censuses using surface modelling concepts
Environment & Planning A
(1995) - et al.
A practical method to reduce privacy loss when disclosing statistics based on small samples” NBER Working Paper 25626
(2019) - et al.
Calibrating noise to sensitivity in private data analysis
- et al.
Dasymetric mapping and areal interpolation: Implementation and evaluation
Cartography and Geographic Information Science
(2001) - et al.
A framework for the areal interpolation of socioeconomic data
Environment & Planning A
(1993) - et al.
Areal interpolation, A variant of the traditional spatial problem
Geo-Processing
(1980) A geostatistical framework for area-to-point spatial interpolation
Geographical Analysis
(2004)
Geostatistical prediction and simulation of point values from areal data
Geographical Analysis
Validating population estimates for harmonized census tract data, 2000-2010
Annals of the Association of American Geographers
Cited by (10)
Optimized spatial information for 1990, 2000, and 2010 U.S. census microdata
2024, Scientific DataUsing Public Data to Improve Population Estimates Within Consistent Boundaries
2024, Professional GeographerParks, People, and Pollution: A Relational Study of Socioenvironmental Succession
2023, City and CommunityLinear feature conflation: An optimization-based matching model with connectivity constraints
2023, Transactions in GISParsimonious stochastic forecasting of international and internal migration on the NUTS-3 level – an outlook of regional depopulation trends in Germany
2023, Vienna Yearbook of Population Research