Main

A large body of scientific evidence documents how environmental exposures can substantially shape human outcomes. For instance, poor air quality is estimated to kill millions of people per year1, warming temperatures lead to more crime and less economic output2, and exposure to lead reduces educational achievement3. Effects can vary substantially across subgroups: air pollution is more harmful to health in poorer US counties4, warming temperatures have more negative effects on economic output in already hot locations5 and lead exposures appear to have larger effects in already disadvantaged households6. Standard models of behaviour in both public health and economics suggest that the magnitude and variation of these effects in part reflect choices that individuals make or are unable to make regarding their exposures, and that the choice sets available to individuals are in turn reflective of individuals’ knowledge, circumstances and preferences7,8. Unfortunately, these decision-making components, as well as their behavioural outcomes, are typically hard to observe at scale. This makes it difficult to understand why a given environmental exposure generates the effect it does, why this effect might differ across groups, and whether and how policy should respond. A lack of data on behaviours and exposures also makes the evaluation of existing policy approaches challenging, which is particularly problematic for the large number of environmental settings—from wildfires to heat waves to hurricanes—in which current policy largely relies on individuals to protect themselves.

Here we show how combining traditional sensor and survey data with information from non-traditional distributed sensors—including data from private outdoor and indoor pollution sensors, cell phones, social media posts and internet search activity—can generate population-scale insights into people’s knowledge, preferences and choices regarding a changing environment, and into how economic circumstances shape their choice set. We focus on understanding responses to wildfire smoke, a rapidly growing environmental stressor throughout much of the United States and internationally. The annual area burned by wildfires in the United States has more than doubled in recent decades, as a result of a century of fire suppression and a warming climate that has left the resulting abundant fuel much more flammable9. This increase in fire activity has led to substantial increases in average smoke exposure across the continental United States, potentially reversing decades of improvements in air quality10. Absent substantial intervention, these trends are expected to continue and perhaps accelerate in a warming climate11,12,13.

A growing literature has begun to document myriad health impacts of ambient wildfire smoke exposure14,15,16,17,18. As with impacts of other environmental stressors, the magnitude of these health impacts may depend on behaviours and individual-specific exposures that are often poorly measured and understood. In particular, recent wildfire case studies suggest that individuals vary in their knowledge and beliefs about their own exposures and about the risks that these exposures pose to their health19,20. Documented heterogeneity in the impacts of wildfire smoke exposure and exposure to other particulates4,18,21,22,23 suggests that both socio-economic circumstances and previous exposures could also constrain behaviour in important ways. Understanding which features matter most is consequential for policy design: impacts driven by a lack of awareness of one’s exposure, for instance, call for different interventions than impacts driven by an inability to protect oneself from a known exposure. Such understanding is particularly important for wildfire, given that current policy approaches to risk mitigation focus on the private provision of protection—that is, asking individuals to stay indoors, limit infiltration and purchase protective technologies24.

To better understand exposures, behavioural responses and outcomes in the face of rapidly changing wildfire risk, we first develop measures of daily exposure to ambient particulate matter (PM2.5) from wildfire smoke, as this exposure itself is not directly measured by existing pollution sensors. To do so, we combine long time series of regulatory ground monitor data on PM2.5 concentrations from US Environmental Protection Agency (EPA) pollution monitors with satellite-derived estimates of smoke exposure. To isolate wildfire-smoke-derived PM2.5 from other sources of PM2.5, we define smoke PM2.5 as location-, month- and period-specific anomalous PM2.5 on days in which satellites indicated that smoke was overhead (Methods). The resulting ambient smoke PM2.5 exposure data cover 782 US counties that contain ~74% of the US population (Extended Data Fig. 1), and these data display wide spatial and temporal variability (Fig. 1). Satellite-based counts of the annual number of dense smoke plumes overhead have trended upwards in the past decade throughout most of the continental United States, particularly in the West (Fig. 1a), which we estimate have helped drive rapid increases in estimated annual smoke PM2.5 across the country and in days with extreme PM2.5 due to smoke (Fig. 1b,c).

Fig. 1: Trends in smoke exposure across the United States.
figure 1

a, Measurements from satellites indicate rapidly growing exposure to ‘heavy’ smoke plumes across much of the United States. The estimates are shown on a 10 km grid63 and indicate the estimated annual increase between 2011 and 2020 in the number of days with smoke plumes that National Oceanic and Atmospheric Administration (NOAA) analysts designate as ‘heavy’, their densest plume classification. The dots indicate EPA ground-based pollution monitors. b, Distribution of estimated annual average smoke PM2.5 across EPA ground monitors reporting in each year. c, Distribution of estimated daily smoke PM2.5 across the same monitors. Recent increases in extreme annual exposure are being driven by increases in extreme daily exposure. Source for map in a: US Census Bureau.

We merge these daily wildfire-smoke-derived data with multiple high-frequency datasets that are measured at the population scale and shed light on individuals’ knowledge, beliefs and behaviours during wildfires. To study individuals’ awareness of their exposure, we first analyse location-specific variation in search query behaviour related to smoke exposure. We use public data on specific search queries (for example, ‘air quality’) from Google Trends, which provides normalized data on search term popularity at the weekly level across ‘designated market areas’ (DMAs) (roughly, metro areas; Methods). We interpret a purposeful search for information related to smoke exposure as evidence that an individual is aware he or she is being exposed and that his or her exposure level is worth learning about—what we call ‘salience’.

Second, we study individuals’ preferences and sentiments regarding wildfire smoke exposure. Such preferences underlie standard theoretical models of choice behaviour but are hard to observe directly and at scale. Past work has shown that social media behaviour can be a sensitive and accurate tool for understanding individuals’ preferences and sentiments towards what is happening around them25, including a changing environment26,27. Following this earlier work, we analyse ~1.7 billion georeferenced Twitter updates (‘tweets’) posted since 2016 using natural language processing algorithms that extract information on the sentiment revealed in each tweet28 (Methods). This approach has been validated at the population scale against self-reported measures of emotional state25 and complements earlier work that used Twitter to directly measure wildfire activity29 and infer smoke concentrations30.

Third, we again use Google search queries to study whether individuals sought information regarding specific health-protective actions, analysing item-specific search terms such as ‘air filter’, ‘air purifier’ and ‘smoke mask’. While we cannot observe whether individuals eventually purchased these items except in the case of PurpleAir monitors (as described below), such search behaviour can be interpreted, at a minimum, as evidence of an individual’s belief that health-protective options exist. Evidence from other settings suggests that search activity is predictive of future behaviour, including consumer purchases31,32.

Fourth, we use smartphone-derived location data to study whether individuals altered their physical movements during periods of smoke exposure. Short-term migration in response to other environmental stress (such as hurricanes) is common and is a plausible avenue by which individuals or households could seek to limit exposure to wildfire smoke. We study both the share of people estimated to be completely at home and the share estimated to be completely away from their homes, on days or weeks of smoke exposure.

We combine each measure with our smoke PM2.5 data and analyse the effect of smoke on each outcome using panel fixed effects estimators that exploit local temporal variation in both exposures and outcomes. While long-term exposure to wildfire smoke shows clear spatial patterns and temporal trends (Fig. 1), local-level variation in daily exposure is highly random, and panel estimators—which are commonly employed in related environmental settings2—plausibly isolate the impact of variation in smoke exposure from other time-invariant and time-varying factors that could be correlated with both smoke exposure and outcomes, including potential confounding from COVID-19 (Methods). To ensure that we are measuring the impact of wildfire smoke and not simply proximity to wildfire itself, we develop measures of distance to the nearest active wildfire and analyse whether responses differ by fire proximity. The unit of observation in these analyses is either the county-day (for sentiment and mobility analyses) or the metro area-week (for salience and health protection measures).

Finally, we analyse how ambient outdoor smoke PM2.5 infiltrates into the indoor home environment and whether behaviours and circumstances shape this infiltration. Understanding indoor concentrations is critical, as individuals in the United States spend the vast majority of their time indoors. Using data from the American Time Use Survey, we calculate that Americans on average spend >70% of their time indoors at home, with higher shares for lower-income and elderly individuals and overall shares trending up over time (Extended Data Fig. 2). Personal integrated exposure to variation in ambient exposure is then probably substantially mediated by characteristics of home and work environments that are hard to observe33,34,35, and these differences could in turn affect outcomes36,37,38,39. If socio-economic or demographic variables shape indoor environments in ways that affect exposures, as has been hypothesized37, then exposure levels or policy choices that appear equitable on the basis of traditional outdoor measures could obscure large disparities in realized exposures.

We assemble and harmonize hourly data from 1,520 indoor PurpleAir air pollution monitors that individuals have put in their single-family homes across the United States and use nearby outdoor PurpleAir monitors to construct outdoor PM2.5 concentrations at each home (Methods). To estimate infiltration, we use distributed lag or lagged-dependent variable panel regression to estimate the marginal increase in indoor PM2.5 when outdoor PM2.5 increases by one unit (that is, ∂IndoorPM2.5/∂OutdoorPM2.5), controlling flexibly for time of day, day of week and month of sample (Methods). We estimate models that pool all indoor monitors as well as monitor-specific models, and we study how infiltration differs as a function of household and neighbourhood characteristics. Our approach complements recent work using PurpleAir to study infiltration generally40 and during wildfires specifically33, though our estimation approach offers advantages relative to the latter such as robustness to indoor pollution sources and to diurnal patterns in infiltration-relevant behaviours (Methods).

The timeliness and granularity of passive distributed sensor data need to be weighed against their potential non-representativeness, as the latter can bias population-scale inferences. Our search data and mobility data are probably our most representative, as the vast majority of Americans use the internet regularly and most own and use smartphones. Twitter users are less representative on average, but Twitter-derived sentiment measures have been shown to validate well against population emotional state, and related work shows that the response of sentiment to environmental stress mirrors that measured in representative survey data. The PurpleAir data are the least representative of our datasets, with wealthier and more educated households more likely to own monitors; however, as discussed below, socio-economic and demographic information does not appear strongly predictive of infiltration. See the Methods for a more detailed discussion of sample representativeness.

Results

Increases in wildfire-derived ambient PM2.5 exposure lead to an increase in popularity for air-quality-related search terms, with even small increases above zero exposure appearing salient (Fig. 2a; P < 0.001; linear model effect size, 0.689; 95% confidence interval (CI), 0.503, 0.874). The results are robust to alternate air-quality-related search terms and to using analogous search terms in Spanish (Supplementary Table 1), are not driven by proximity to active wildfires, and are robust to the inclusion of weather controls or alternate fixed effects (Supplementary Tables 2 and 3). Placebo search terms plausibly unrelated to smoke exposure do not respond to exposure, and search for smoke-related terms does not respond to variation in PM2.5 on non-smoke days (Supplementary Table 4). Our results are consistent with interview-based evidence finding that individuals who perceived they were being exposed to smoke often used internet-based sources to confirm their perceptions19, although we cannot easily distinguish whether individuals recognized that they were being exposed to smoke PM2.5 specifically or just to poor air quality from any source (Supplementary Table 3). The salience of ambient exposure at low levels is also somewhat reassuring given recent evidence of health impacts for sensitive populations even at very low levels of ambient exposure18.

Fig. 2: Behavioural responses to wildfire smoke exposure.
figure 2

a, Searches for ‘air quality’ on Google at the level of US designated marketing area by week, 2016–2020. The search index is normalized such that zero indicates no searches and 100 indicates the maximum number of weekly searches over the period. b, Average sentiment in tweets at the county-day level, 2017–2020. c, Searches for ‘air filter’ on Google at the level of US marketing area by week, 2016–2020. The search index is normalized as in a. d, Percentage of mobile phones estimated to be completely at home on a given day at the US county level, 2019–2020. The black lines are regression point estimates from spline fits conditional on fixed effects, and the shaded areas show bootstrapped 95% CIs. The number of observations in each regression is shown in the upper left corner of each panel. The histograms at the bottom show the log distribution of smoke PM2.5 exposure in each sample.

We find that exposure to ambient smoke PM2.5 makes people unhappier, as measured in an automated sentiment analysis of five years of tweets across the United States. Expressed sentiment in tweets declines roughly linearly above smoke PM2.5 exposures of ~20 μg m3 (Fig. 2b; linear effect size of 100 μg m−3 increase, −0.0087; 95% CI, −0.0108, −0.0067; P < 0.001). A very bad smoke day (average smoke PM2.5 concentration of 100 μg m3) is associated with a 0.03 decline in sentiment in the non-linear model shown in Fig. 2b, which is equivalent to a roughly 0.2-standard-deviation decline relative to the overall sample standard deviation. For context, the average difference between tweet sentiment on Wednesdays and Saturdays (respectively, the lowest- and highest-sentiment days of the week in our data) is about 0.007 in our data; one day of very bad smoke (100 μg m3) is thus about four times worse in sentiment terms than replacing an average Saturday with an average Wednesday.

Negative effects of smoke on sentiment could occur through a variety of channels, including from fear or anxiety about proximate fires themselves or about what the fires represent (for example, a changing climate), from unhappiness due to disruption in normal activities (for example, school closure or inability to recreate), or from anticipation or experience of negative health impacts. While we cannot distinguish the latter channels in our data, the effects of smoke on sentiment are not driven by proximity to active wildfire and are robust to temperature and rainfall controls and to alternate fixed effects (Supplementary Table 5). These results are consistent with a broader literature documenting the negative psychological effects of air pollution exposure41.

Exposure to smoke PM2.5 increases search activity related to protective behaviour. Searches for technologies known to help limit exposure, including ‘air filter’, ‘air purifier’, ‘smoke mask’ and ‘purple air’, all increase on days in which smoke exposure is higher (Fig. 2c and Supplementary Table 6; P < 0.001; linear model effect size, 0.453; 95% CI, 0.3, 0.606). Some search queries in Spanish (‘purificador de aire’) respond similarly, although others (‘filtro de aire’) do not (Supplementary Table 6).

Finally, smoke PM2.5 exposure on average causes more people to not leave their homes, with immediate increases at low levels of exposure that flatten off at high levels (Fig. 2d). A day of smoke exposure above 50 μg m3 leads to a roughly three-percentage-point increase in the proportion of people fully at home (P < 0.001; linear effect size, 0.023; 95% CI, 0.016, 0.031), which corresponds to about a 10% increase above the mean. Smoke PM2.5 exposure has a limited effect on the proportion of people fully away from their homes at low exposure levels but an increasing effect at higher exposure levels (Extended Data Fig. 3). Both results can be interpreted as protective behaviour: during heavy smoke days, many individuals shelter in their homes, and some leave the area when exposure gets severe. Both results are robust to controls and are not driven by proximity to active fires; the effect on the percentage of people at home is less robust to the addition of more stringent time controls (Supplementary Table 7).

Exposure and response heterogeneity

Individuals are likely to respond to environmental exposures in different ways, either because their personal exposure varies or because, for a given exposure, their knowledge of that exposure or their willingness or ability to respond to it differs. We explore heterogeneous exposures and responses to wildfire smoke as a function of socio-economic status (as measured by locality-specific median household income) and variation in average previous exposure to wildfire or other PM2.5. Previous literature suggests that both could moderate behavioural responses to environmental stress through a variety of mechanisms, including through differential access to information about exposure risk or differential ability, motivation or knowledge of how to take protective action8,20,42.

Consistent with earlier work10, but in stark contrast to strong socio-economic and ethnic/racial gradients in exposure to other key pollutants in the United States43,44, we find that exposure to both average and acute smoke PM2.5 is largely uncorrelated with income in the United States (Extended Data Fig. 4). We also find no differences in salience of smoke exposure between lower- and higher-income counties, with similar responses of search query activity to a day of heavy smoke across income levels (Fig. 3a; P = 0.85 on linear interaction).

Fig. 3: Smoke salience does not differ, but other responses do differ, among more and less wealthy populations.
figure 3

a, The effect of a heavy smoke exposure (50 μg smoke PM2.5 on that day) on searches for ‘air quality’ on Google do not differ by income. b, A heavy smoke day has a stronger negative effect on sentiment among wealthy populations. c, Wealthier populations are substantially more likely to search for ‘air filter’ on Google during a heavy smoke day. d, Wealthier populations are more likely to be completely at home during a day of heavy smoke. The black lines show the regression point estimates of the marginal effect of a 50 μg smoke day on the outcome at different income levels, and the shaded areas show bootstrapped 95% CIs. The slope of each line is reported in the upper left with the corresponding P value based on a two-sided t-test, and the histograms at the bottom show the distribution of county incomes in each sample. The sample in each panel is same as in the corresponding panel in Fig. 2.

Other behavioural measures show strong income gradients. For sentiment, wealthier counties respond much more negatively to a heavy smoke day than less wealthy counties (Fig. 3b; P = 0.001; effect size on linear interaction, −0.001; 95% CI, −0.002, −0.001). This finding is not driven by average differences in sentiment between more and less wealthy counties, by higher overall variation in sentiment in wealthier versus poorer counties (temporal variation in sentiment is lower in wealthy counties than in less-wealthy counties in our sample) or by differences in average exposure to smoke PM2.5 or other sources of PM2.5 (Supplementary Table 9). These results are consistent with a similar analysis in China, which also showed larger negative sentiment responses to air pollution in higher-income cities45.

Search activity related to protective behaviour is also substantially higher in wealthier counties (Fig. 3c; P < 0.001; effect size on linear interaction, 0.013; 95% CI, 0.006, 0.019) and is not statistically different from zero in roughly the bottom third of the county income distribution. Finally, populations in wealthier counties are also substantially more likely to remain fully at home during a day of heavy wildfire smoke exposure than lower-income populations (Fig. 3d; P < 0.001; effect size on linear interaction, 0.002; 95% CI, 0.001, 0.002). These results are robust to more or less restrictive time controls (Extended Data Fig. 3c). We find no meaningful difference across income groups in the proportion of individuals fully away from their houses during days of heavy smoke exposure (Extended Data Fig. 3d).

Why do wealthier locations respond differently to smoke exposure? The measured differences do not appear to reflect differences in exposure information or in overall internet activity, given the consistent response of air-quality-related searches across income groups. Rather, the responses are consistent with lower incomes constraining choice sets and behaviours, including less flexibility in working from home, fewer resources with which to consider purchasing protective technology and (regarding the sentiment results) having other more pressing matters to worry about.

We find that behavioural measures are also affected by previous experience with smoke and with other PM2.5 sources. An additional smoke day was less salient in locations with higher previous exposure to smoke PM2.5, and people in locations with higher average PM2.5 exposure prior to our study period showed smaller declines in sentiment during an additional high smoke day and fewer searches related to health-protective behaviour, but were more likely to stay at home when smoke PM2.5 was high (Supplementary Tables 8 and 9). These results are consistent with individuals adapting their behaviour and beliefs on the basis of repeated exposure—for example, through investments in health-protective technologies.

Smoke PM2.5 infiltration into indoor environments

We find that census tracts with PurpleAir monitors tend to be wealthier on average than tracts without monitors (Extended Data Fig. 5), a finding consistent with other analyses46 and with the income-differentiated search activity for ‘purple air’ and related health-protective technologies found above. Nevertheless, the average income of locations owning indoor monitors varies by roughly 10× across locations, enabling an exploration of the role of income and other demographic factors in shaping exposures among a population with identical access to information on their exposures.

Using a pooled model, we estimate that a 1 μg m−3 increase in outdoor PM2.5 is associated with a 0.145 (95% CI, 0.135, 0.153; P < 0.001) μg m3 increase in indoor PM2.5 over the next six hours. The estimates are robust to alternate regression models and alternate corrections to the monitor data (Extended Data Fig. 6) and are comparable in magnitude to recently published estimates40. Estimated infiltration is substantially lower during periods of high outdoor PM2.5, and responses differ during smoke periods (Fig. 4a). When no smoke is present, at median outdoor PM2.5 concentrations (6 μg m−3), infiltration declines by 0.0281 for each 10 μg m−3 increase in outdoor PM2.5 (95% CI, −0.02925, −0.02810; P < 0.001). However, when smoke is present, infiltration declines by only 0.0209 for each 10 μg m−3 increase in outdoor PM2.5 (95% CI, −0.02141, −0.02043; P < 0.001). Earlier findings of lower infiltration on smoke days33 were probably capturing the effect of overall high PM2.5 rather than the effect of smoke-derived PM2.5 specifically.

Fig. 4: Infiltration rates decline strongly with outdoor PM2.5 concentration during both fire and non-fire periods, but not with income, and they differ greatly across households, resulting in extreme differences in indoor exposure during wildfires.
figure 4

a, Infiltration rates, measured as the integrated hourly change in indoor PM2.5 per unit increase in outdoor PM2.5, are lower during periods of high outdoor PM2.5 but decline more quickly when PM2.5 comes from other sources. The lines are regression point estimates, and the shaded areas are bootstrapped 95% CIs. b, Infiltration rates are only weakly correlated with census-block median income; the lines, shading and units are as in a. c, The distribution of infiltration rates estimated separately for each household in the sample is wide. d, Residents from the highest and lowest infiltration quartiles in the Bay Area64 are mapped and coloured according to group, showing that the groups are geographically intermixed. e, Daily outdoor PM2.5 concentrations during the unprecedented August–September 2020 wildfire smoke event were highly similar between high- and low-infiltration households. The transparent lines show concentrations at each residence, and the thick lines show the averages within each group. The left panel shows daily mean concentrations, and the right panel shows averages across all days during the event. f, Same as e but for indoor PM2.5 concentrations, showing very large (>100 μg) daily differences during peak outdoor concentrations (left) and 3.5× differences in average indoor exposure between high- and low-infiltration groups (right).

Consistent with our other behavioural measures, declining infiltration at high outdoor PM2.5 levels suggests that salient ambient exposures induce behavioural responses, which could include closing windows or doors and/or using mechanical filtration. However, in contrast with our other behavioural measures, we find only a modest relationship between neighbourhood average income and infiltration, with households in much wealthier census blocks experiencing only slightly lower average infiltration than households in areas with one quarter the average income regardless of whether the PM2.5 was smoke-derived (Fig. 4b; linear interaction effect size, −0.012 μg m−3 indoor PM2.5 per additional 1 μg m−3 outdoor PM2.5 for each US$100,000; 95% CI, −0.030, 0.006; P = 0.180).

To further explore predictors of infiltration, we estimate infiltration separately for each of the 1,520 indoor monitors in our dataset, match each monitor to a wide range of house- and neighbourhood-specific socio-economic, demographic, environmental and housing covariates, and fit flexible machine learning models relating infiltration to these covariates (Methods). Consistent with other work40, we find many-fold differences in household-specific infiltration rates (mean = 0.19, s.d. = 0.16; Fig. 4c), and we confirm using a Bayesian hierarchical model that this variation is largely due to ‘true’ underlying variation between households rather than to sampling noise in household-level estimates (Extended Data Fig. 7a). The estimates are only modestly correlated with traditional indoor/outdoor ratio estimates (Extended Data Fig. 8), perhaps due to the difficulty in accounting for indoor sources of emissions or diurnal behavioural patterns in the traditional indoor/outdoor approach (Supplementary Information).

While racial/ethnic, socio-economic, environmental and housing variables are associated with infiltration on held-out data, their individual explanatory power is very modest, and our rich set of predictors and flexible models are surprisingly poor predictors of overall variation in infiltration, explaining only ~5% of variation across indoor monitors in our data (Extended Data Fig. 7b,c). This lack of predictive ability of socio-economic factors is also apparent on individual smoke days, where even among relatively socio-economically advantaged households, very similar outdoor PM2.5 concentrations during a given smoke day are associated with widely varying indoor PM2.5 concentrations (Extended Data Fig. 9).

To further investigate the differential influence of behaviour versus housing characteristics (and associated socio-economic factors), we re-estimated infiltration for individual households during periods when windows were likely to be closed and indoor filtration not running (Methods). While average infiltration during these periods was relatively similar to infiltration during all periods (Extended Data Fig. 7d), infiltration varies much more strongly with both income (Extended Data Fig. 7e) and housing age (Extended Data Fig. 7f) under these conditions. Taken together, and consistent with previous smaller-scale work39, our results indicate that the poor explanatory power of socio-economic and housing characteristics is driven not by poor measurement of these characteristics but by the dominant effect of idiosyncratic household-specific behaviours that are not correlated with these characteristics.

Finally, using indoor monitors across the Bay Area and data prior to August 2020, we divide monitors into low (bottom quartile) and high (top quartile) infiltration groups (Fig. 4d) and study outdoor and indoor PM2.5 levels across these groups during the extreme wildfire smoke event that the area experienced in August–September 2020. High- and low-infiltration households experienced nearly identical daily outdoor concentrations during the many-week event (Fig. 4e), but these ambient levels led to starkly different indoor concentrations. On the worst smoke days, daily average indoor concentrations across all high-infiltration homes exceeded 65 μg m3, and in some houses they exceeded 100 μg m3, well above the World Health Organization 24-hour PM2.5 exposure guideline of 15 μg m3. Low-infiltration households were on average able to maintain indoor PM2.5 concentrations near 5 μg m3. Across the duration of the smoke event, daily mean indoor concentrations were on average 3.5× higher in the highest quartile versus the lowest quartile of infiltration households. Differences were even larger when looking across all Bay Area monitors: households with average outdoor PM2.5 levels within 5 μg m3 of each other experienced >20× differences in average indoor PM2.5 concentrations during the smoke event (Extended Data Fig. 10).

Discussion

A growing literature documents the large and often disparate impacts of wildfire smoke on a range of health outcomes14,15,16,17,18,21,22. Our results show how non-traditional sensor data can provide policy-relevant insight into why the magnitude and incidence of these impacts might vary. Multiple lines of evidence indicate that awareness of smoke concentrations does not appear to be a primary constraint on individual behaviour in the face of wildfire smoke exposure: even small increases in ambient exposure cause individuals to seek air quality information, become unhappier and stay in their homes. But while awareness appears to be broadly shared, it does not lead to adequate health protection. Even among populations that own indoor monitors and thus have access to accurate, real-time measures of their indoor concentrations, information is not enough to limit dangerous indoor exposures to these pollutants. This suggests that policies targeting information provision about smoke are insufficient, and perhaps not central, to enabling protective behaviour.

Socio-economic status is not correlated with outdoor smoke levels but does appear to mediate behavioural responses to such pollution. Wealthier households in our sample can more easily stay home, are more likely to seek information on protective technology and are more likely to own indoor pollution monitors. Such differential behaviour is consistent with a broader literature that shows how socio-economic status constrains households’ abilities to invest in environmental quality and health protection8,47. Yet, at least in our sample of monitor-owning households, income is only weakly correlated with the infiltration of ambient smoke into indoor environments, and we observe many households in wealthy neighbourhoods experiencing exceedingly high levels of indoor smoke exposure.

Our results suggest that this is probably because present infiltration rates are dominated by actions such as opening windows and doors, not housing materials or quality that might be reflected in prices. Infiltration patterns thus point to the importance of behaviour that remains unobserved, a fact that is both encouraging and troubling. If simple but difficult-to-observe behaviours such as closing windows and doors explain the vast majority of variation in smoke infiltration, then reducing infiltration at the population scale could be much easier in theory than if infiltration was largely determined by income or housing quality, as changing these latter factors requires addressing deeper societal problems of inequality and structural racism. Nevertheless, a key limitation of our infiltration analysis is our reliance on a convenience sample of households who own PurpleAir monitors, who are overwhelmingly Californian and higher-income. Better measurement of indoor air quality and infiltration in lower-income households, and in households around the country, remains a critical research priority.

Current policy approaches to addressing smoke exposure focus on behavioural recommendations to stay at home and close windows and doors24, but our results suggest that these policies alone are difficult to comply with and may still be inadequate: many households’ indoor environments remain highly exposed, and our mobility results suggest that adherence might be difficult for lower-income households. If such behaviours are indeed hard to adopt, then the policy approach of promoting private provision of protection could be biased against disadvantaged groups. This policy approach also stands in stark contrast to the approach of public provision of protection used for other sources of PM2.5, which has sought to reduce emissions of pollutants at their source and which has successfully reduced overall ambient exposure inequalities48. Further understanding the variation and causes of the behaviours that can protect indoor environments will be key to designing policy that can both lower indoor concentrations and not disadvantage certain groups.

Methods

All of the data used in our study are either from public sources where individuals are posting public statements and/or consenting to have their location tracked (Twitter and PurpleAir), or from spatially and temporally aggregated data with no available personally identifying information (search trends data and mobility data).

Estimating ground PM2.5 concentrations from smoke

We develop a generic, tractable method for estimating ground PM2.5 attributable to smoke at the daily level. The method requires a credible estimate of whether there is smoke in the air on a given day, and a daily time series of PM2.5 from which location- and period-specific anomalies can be constructed. In principle, any available (accurate) daily PM2.5 estimates could be used, including recent promising machine-learning-based efforts at generating high-resolution gridded time series of PM2.5 concentrations49,50. However, existing gridded data are not available for recent years, so we instead use station-based daily PM2.5 measures from the network of thousands of EPA stations across the continental United States (Fig. 1a).

To construct our daily measures of smoke PM2.5, we define PMidmy as the PM2.5 concentration recorded by the EPA monitor at location i on day d, month m and year y. From this time series, we construct location- and month-specific anomalies \({\mathrm{PManom}}_{idmy}={\mathrm{PM}}_{idmy}-\overline{{\mathrm{PM}}}_{imy}\), where \(\overline{{\mathrm{PM}}}_{imy}\) is the monthly median PM2.5 on non-smoke days at that location, and where median is defined over the three years surrounding the year of interest. We use a three-year moving median to account for the long-term declining trend in PM2.5 across most of the United States driven by non-wildfire causes10. So, for example, a PM2.5 anomaly for the Redwood City, California, EPA station on 10 January 2019 is calculated as the value on 10 January 2019 minus the median PM2.5 value on all January days in 2018, 2019 and 2020 in Redwood City when smoke was not overhead. Our measure of whether smoke was overhead, plumeidmy, is derived from the NOAA Hazard Mapping System (HMS) satellite estimates of smoke plume boundaries. We define plumeidmy = 1 if there was a smoke plume of any thickness over location i during any time on day d, and zero otherwise. We estimate that having a smoke plume of any thickness overhead increases daily PM2.5 concentrations at EPA reference monitors by an average of 4.0 μg m3, and the effect becomes stronger as plume density increases from light to medium to heavy (Supplementary Table 10). Finally, from these data we can construct SmokePMidmy = PManomidmy × plumeidmy. SmokePMidmy will thus equal zero when there is no plume overhead and will equal the anomaly value when there is smoke overhead. Our approach thus provides a continuous measure of smoke exposure intensity. We note that our approach is unaffected by an overhead smoke plume that does not mix down to the surface; in that case, ground PM2.5 anomalies would be zero, and so no smoke PM2.5 would be assigned.

Our approach is similar to recent work51 using interpolated station data and plumes to estimate smoke PM2.5. However, given the high spatial variation in smoke exposure and the often large distance between EPA stations, we chose not to interpolate EPA stations. To confirm that just one or a handful of monitoring stations in a given county or metro area can adequately represent temporal variation in smoke exposure in that area, we computed the pairwise correlations between time series of smoke observations in each pair of stations in our data, restricting to stations with at least 1,000 days of data (yielding >85,000 pairwise combinations). We then studied correlation in smoke PM2.5 between stations as a function of distance between stations (Extended Data Fig. 1b). Counties in our sample (our main unit of analysis) have an average width of 55 km, and metro areas (used in the Google Trends data, described below) have an average width of 228 km; these widths represent the upper bound on an individual’s distance from a monitor in our data, and average distances are probably much smaller given that monitors are purposely placed in populated locations. Median correlations in smoke PM2.5 variation are on average r = 0.84 and r = 0.63 at these distances, suggesting that data from an individual point in a county/metro area is reasonably highly representative of variation elsewhere in the county/metro area. We emphasize that our statistical models exploit this location-specific temporal variation in smoke PM2.5, which is unlikely to be affected by spatial bias or unrepresentativeness in average pollution values at some stations52,53. Any remaining non-systematic measurement error due to distance from monitors will attenuate our estimated effects of smoke towards zero54.

Measuring salience and health-protective behaviour

We measure salience and health-protective behaviour using public search query data from Google Trends. The data are accessed using the R package gtrendsR version 1.4.8.900055 and are provided as location-, term- and period-normalized indices ranging from 0 to 100, where 0 is the lowest search volume for that term in that location during the chosen period, and 100 is the highest search volume. The data are available at the DMA level (referred to as ‘metro’ areas by Google Trends), which are geographic regions encompassing television media markets as defined by Nielsen.

We study searches in both English and Spanish, which together are the primary languages spoken by 92% of US households56. We use weekly data on DMAs, the native spatial resolution of the public Trends data, between January 2016 and December 2020, and analyse data on terms related to smoke exposure (including ‘air quality’, ‘smoke’ and ‘wildfire smoke’).

Measuring sentiment

We measure online sentiment for a county-day using the text of Twitter posts (‘tweets’) created in that county on that day. Specifically, we collect nearly all of the geolocated tweets for the continental United States between December 2016 and February 2021 through the Twitter Streaming API, in accordance with the terms and conditions laid out in Twitter’s Developer Agreement (https://developer.twitter.com/en/developer-terms/agreement). Per the agreement, the authors cannot make individual tweets available publicly. To compute sentiment for each tweet, we apply the VADER sentiment analysis model28, a natural language processing algorithm tuned specifically for estimating sentiment from online language. We take the average of the ‘compound’ scores (ranging between −1 and 1) computed by VADER for all tweets in a county-day as our measurement of sentiment. Our approach builds on the computation of expressed sentiment described in ref. 26. Readers may refer to that article for additional details on the general approach to collecting and processing tweets for use in empirical analysis. On average, the mean sentiment for a county-day is 0.17, computed from 455.4 tweets.

Measuring mobility

We assembled a daily dataset of mobility measures at the county level collected between January 2019 and December 2020, the period over which mobility data were made available to researchers by SafeGraph. These data measure the aggregate activity of anonymized device signals, or ‘pings’, at the census block group level. Signals are collected from smartphones, not all cell phones. We focus on two measures constructed from these anonymized signals: the percentage of individuals completely at home on that day, and the percentage of individuals completely away from home on that day. We construct the ‘completely away from home’ variable by counting the percentage of devices on a given day that were not observed in their respective home location. SafeGraph assigns a home location to each device on the basis of its mobility pattern observed over the previous six weeks. We aggregate these data to the county-day level by taking means weighted by the number of devices in each census block. The data processing details are discussed further in ref. 57.

Measuring distance to fire

To distinguish the effects of exposure to wildfire smoke from potentially correlated effects of being near an active wildfire, we develop daily measures of proximity to active wildfires and test whether the effects of smoke we uncover on outcomes might instead be the direct effects of proximity to fire. We compute ‘distance to fire’ as the population-weighted average distance from 10 km grid cell centroids within a county to their nearest NOAA HMS fire point(s) and as the distance from a DMA centroid to the nearest fire cluster. Building on earlier work10, ‘fire clusters’ are constructed by buffering each HMS fire point by 3 km square and taking the union of existing overlapping squares over a given day and the previous three days, and distance to fire cluster is set to 0 if the active fire cluster is inside the DMA on that day. This does not mean that 10 km2 are burning, but within that 10 km2 there are multiple fire points over a three-day period, representing an active and potentially growing fire. We emphasize that our goal in this analysis is not to test the independent effect of proximity to wildfire on these outcomes, but to understand whether we’re actually isolating smoke impacts or conflating them with fire proximity.

Estimating ambient smoke impact

We combine the above behavioural measures with our smoke PM2.5 estimates and analyse their correspondence using panel fixed effects estimators, with the goal of isolating the impact of variation in smoke exposure from other time-invariant and time-varying factors that could be correlated with both smoke exposure and outcomes. Specifically, we estimate econometric models of the form:

$${y}_{isdmy}=f({\mathrm{SmokePM}}_{isdmy})+{Z}_{isdmy}+{\alpha }_{im}+{\eta }_{d}+{\varepsilon }_{isdmy}$$
(1)

where yisdmt is outcome of interest in unit i, state s, day d, month m and year y; SmokePMismy is our smoke PM2.5 measure on the same day and location; and Zismy are additional time-varying controls. Our preferred model includes a location-by-month fixed effect αim to account for local seasonality in either outcomes or exposures (for example, one intercept for each of the 12 months in Santa Clara County, California) and a day-of-sample fixed effect ηd (for example, a dummy for 1 January, another for 2 January 2016 and so on) to account for common trends or shocks to outcomes or exposures on a given day. Our date fixed effect implicitly also accounts for any average differences in outcomes between weekends and weekdays. We estimate f() using either linear models or more flexible cubic splines to capture potential nonlinearities. In all analyses using search query data, Twitter data or mobility data, smoke PM2.5 is measured using EPA station data, as described above.

In these models, the effect of smoke exposure on outcome y is estimated by relating, for example, outcomes in Santa Clara County on 30 August 2020 versus 1 September 2020 to differences in smoke exposure on those days, after accounting for any common difference across counties in exposure or outcomes between the two days, and any average differences in smoke exposure or outcomes in August versus September in Santa Clara County. A confounding variable would have to be a local time-trending unobservable correlated with both smoke exposure and the outcome. Possible candidates include weather variables and the presence of an active wildfire nearby, and we additionally control for these variables (Zisdmy in equation (1)) in robustness tests, or split the sample between locations nearby and further from an active wildfire.

Another potential threat to identification is the COVID-19 pandemic, which near the end of our sample period had demonstrated effects on mobility58 and sentiment59, and probably enhanced awareness about the importance of air filtration60; 2020 was also a year of severe smoke exposure throughout much of the US West. Because we exploit daily variation in smoke exposure over time at particular locations, and because such variation depends largely on stochastic factors such as exactly where fires ignite and which way the wind is blowing, we believe that daily variation in COVID-19 outcomes or behaviours is unlikely to be spuriously correlated with wildfire smoke exposures. However, to further address this confounding risk, we test robustness to even more stringent time controls, including county-by-month-of-sample fixed effects and state-by-day-of-sample fixed effects; these further account for any state-specific differences or trends in COVID-19 severity and/or policy intervention that happened to coincide with wildfire risk. We note that any changes in our observed behavioural outcomes due to wildfire-specific effects on health outcomes, including wildfire’s potential effects on COVID-19 itself17, are not confounding and would constitute part of the overall ‘effect’ that we wish to understand.

To study whether the effects of smoke on outcomes vary across locations, we interact smoke exposure with time-invariant covariates of interest:

$${y}_{isdmy}={\beta }_{1}{\mathrm{SmokePM}}_{isdmy}+{\beta }_{2}{\mathrm{SmokePM}}_{isdmy} \times {X}_{i}+{Z}_{isdmy}+{\alpha }_{im}+{\eta }_{d}+{\varepsilon }_{isdmy}$$
(2)

where Xi in our analysis includes median household income, average previous exposure to PM2.5 and average smoke PM2.5 exposure, included either individually or jointly. Because our analysis is at the county level, and because some covariates (particularly income) could vary substantially within counties, the heterogeneous treatment effects estimated on county data with equation (2) could understate the true underlying heterogeneity in responses to smoke exposure.

Our approach does not allow us to estimate whether individuals respond to smoke PM2.5 differently than they do other sources of PM2.5. Unlike for smoke PM2.5, we do not have a research design that can isolate plausibly exogenous variation in other sources of PM2.5. For instance, if traffic is an important daily driver of non-smoke PM2.5 in a given location, and traffic volume is correlated with a booming economy, an analysis of the impact of non-smoke PM2.5 on any of our outcomes would struggle to separate the impact of the PM2.5 itself from the impact of the activity that generated the PM2.5. Even if people were unhappy about high PM2.5 levels and would otherwise stay home, an analysis could easily find that both sentiment and mobility were higher on high-PM2.5 days, as people enjoyed their trips to the office. This confounding is unlikely to be a problem for smoke PM2.5, however, as day-to-day variation in smoke exposure (conditional on our controls) is plausibly random.

Measuring indoor and outdoor household PM2.5 using PurpleAir

To estimate household infiltration of outdoor PM2.5 into indoor environments, we utilize data collected by low-cost PurpleAir monitors. Raw ten-minute observations were downloaded from the PurpleAir servers (available at https://thingspeak.com/) via JSON in accordance with PurpleAir terms and conditions. Data were downloaded from the earliest available date through the end of 2020 or the last available date, whichever is earlier for all available indoor and outdoor PurpleAir monitors in the contiguous United States. Data quality checks were implemented following the procedures utilized in recent studies38,40 to produce hourly indoor and outdoor PM2.5 concentrations. We then followed existing literature and used multiple approaches to estimate PM2.5 concentrations from the cleaned PurpleAir data (Supplementary Information).

Hourly ambient exposures were estimated at each indoor monitor site by first identifying all outdoor monitors within 5 km and then taking the inverse distance weighted average of hourly PM2.5 concentration across the (up to) ten nearest monitors. Monitors with less than 720 non-missing hourly indoor and outdoor PM2.5 measurements (that is, 30 days of hourly data) were excluded from the analysis.

Finally, indoor PurpleAir monitors are deployed in many different types of buildings. We used a combination of information from monitor labels and manual checking of geolocations to determine which buildings with indoor PurpleAir monitors were single-family residences. All other types of buildings were removed from the sample. In total, there were 1,520 indoor monitors reporting in our sample of single-family residences.

Estimating infiltration rates

To estimate the average indoor infiltration rate, which we define as the increase in indoor PM2.5 concentration per unit increase in local outdoor PM2.5 concentration (that is, ∂IndoorPM2.5/∂OutdoorPM2.5), we estimate a regression at the monitor-hour level. Namely, for each residence i in hour h on day-of-week d and month-of-sample m, we estimate how indoor PM2.5 varies with contemporaneous and previous hour measurements of outdoor PM2.5:

$${\mathrm{PM}}_{ihdm}^{{\mathrm{in}}}=\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k}{\mathrm{PM}}_{i;h-k;dm}^{{\mathrm{out}}}+{\gamma }_{i}+{\delta }_{h}+{\eta }_{d}+{\theta }_{m}+{\varepsilon }_{ihdm}$$
(3)

To isolate the contribution of outdoor PM2.5 to indoor PM2.5 from other time-varying PM2.5 sources (most notably, indoor-sourced PM2.5), we use fixed effects to flexibly control for time invariant differences across households (γi), monthly trends in PM2.5 over the sample (θm) and household-specific average variation in PM2.5 within the day (δh). Day-of-week fixed effects (ηd) control for differences in patterns across weekdays and between weekdays and weekends.

We include six lags here (outdoor PM2.5 at each of the previous six hours) to account for lingering effects of outdoor concentrations in previous hours on contemporaneous indoor concentrations, although the results are robust to the inclusion of additional lags. From this regression, we derive an estimate for outdoor–indoor infiltration by calculating the cumulative effect of a 1 μg m3 increase in outdoor concentrations on indoor concentrations:

$${\mathrm{infiltration}}=\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k}$$
(4)

To assess the importance of modelling structure, we re-estimated equation (3) with four different lag structures: a distributed lag model with lags for outdoor PM2.5 only (shown above), a lagged dependent variable model with a lag for indoor PM2.5 only, a model with both indoor and outdoor PM2.5 lags, and finally a model with no lag terms (Supplementary Table 11). Infiltration rate estimates derived from each of the models are highly similar (Extended Data Fig. 6), and models with more than six lags produce indistinguishable estimates of infiltration rates.

To examine heterogeneity in infiltration rates across hourly outdoor pollution levels and by smoke presence, we first estimate a nonlinear version of equation (3). Namely, we model indoor PM2.5 as a fourth-degree polynomial of outdoor PM2.5 (and its lags) and interact it with a dummy variable indicating whether smoke was present. The smoke dummy Sit is defined as 1 when a NOAA HMS plume reported a smoke plume of any density over the PurpleAir monitor on that day and 0 otherwise, where t indexes day of sample and all hours within a given day are assigned the same value for the smoke dummy:

$$\begin{array}{rcl}{\mathrm{PM}}_{ihdmt}^{{\mathrm{in}}}&=&\left[\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k}{\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}+\mathop{\sum }\limits_{k=0}^{6}{\alpha }_{k}{({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}})}^{2}+\right.\\ &&\left.\mathop{\sum }\limits_{k=0}^{6}{\nu }_{k}{({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}})}^{3}+\mathop{\sum }\limits_{k=0}^{6}{\lambda }_{k}{({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}})}^{4}\right]+\\ &&\left[\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k;s}{\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}+\mathop{\sum }\limits_{k=0}^{6}{\alpha }_{k;s}{\left({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}\right)}^{2}+\right.\\ &&\left.\mathop{\sum }\limits_{k=0}^{6}{\nu }_{k;s}{\left({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}\right)}^{3}+\mathop{\sum }\limits_{k=0}^{6}{\lambda }_{k;s}{({\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}})}^{4}\right]\times {S}_{it}+\\ &&{\gamma }_{i}+{\delta }_{h}+{\eta }_{d}+{\theta }_{m}+{\varepsilon }_{ihdmt}\end{array}$$
(5)

To measure the infiltration rate, we then calculate the derivative of indoor PM2.5 with respect to outdoor PM2.5 estimated in equation (5) and use the estimated regression coefficients (β, α, ν, and λ) to evaluate across the 1st–99th percentile of observed hourly outdoor PM2.5 concentrations as well as the indicator for whether or not smoke was present. The responses are plotted in Fig. 4a.

We also estimate infiltration rates as a function of median census tract income and smoke by estimating equation (3) with additional income and income-by-smoke interaction terms:

$$\begin{array}{rcl}{\mathrm{PM}}_{ihdmt}^{{\mathrm{in}}}&=&\left[\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k}{\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}\right]+\left[\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k;i}{\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}\right]\times {\mathrm{income}}_{i}+\\ &&\left[\mathop{\sum }\limits_{k=0}^{6}{\beta }_{k;i;s}{\mathrm{PM}}_{i;h-k;dmt}^{{\mathrm{out}}}\right]\times {\mathrm{income}}_{i}\times {S}_{it}+{\gamma }_{i}+{\delta }_{h}+{\eta }_{d}+{\theta }_{m}+{\varepsilon }_{ihdmt}\end{array}$$
(6)

The median income data come from the American Community Survey. Each indoor monitor was matched to a census tract, and median income was pulled for the most recent available year and updated to 2020 US dollars. We then similarly calculated the derivative of indoor PM2.5 with respect to outdoor PM2.5 and evaluated across the 1st–99th percentile of observed PM2.5 concentrations. The responses are plotted separately for smoke and non-smoke periods in Fig. 4b.

Finally, for each indoor monitor, we estimated a separate distributed lag model analogous to the pooled model in equation (3):

$${\mathrm{PM}^{{\mathrm{in}}}_{ihdm}}=\mathop{\sum }\limits_{k=0}^{6}{\beta }_{ik}{\mathrm{PM}^{\mathrm{out}}_{i;h-k;dm}}+{\delta }_{h}+{\eta }_{m}+{\theta }_{d}+{\varepsilon }_{ihdm}$$
(7)

where PM2.5 at indoor monitor i in hour h on day-of-week d and month-of-sample m is modelled as a function of outdoor PM2.5 in that location in the contemporaneous period and for each of the previous six hours. Our estimate of the overall infiltration rate for each monitor, which we denote βi, is then the sum of coefficients over time from the regression for that monitor (that is, \({\beta }_{i}=\mathop{\sum }\nolimits_{k = 0}^{6}{\beta }_{ik}\)).

Understanding variation in household infiltration rates

Monitor-specific estimates suggest large variation in infiltration across households (Fig. 4c), consistent with earlier work40. However, since monitor-specific infiltration values are themselves estimates from data, the observed variation across monitors could reflect ‘true’ underlying heterogeneity in infiltration or could simply reflect sampling variation (or some combination of the two).

To distinguish sampling variation from underlying heterogeneity, we estimate a Bayesian hierarchical model61,62 that models monitor-specific infiltration estimates as being distributed normally about true monitor-specific infiltration values with estimated monitor-specific sampling variance \({\hat{\beta }}_{i} \sim {\mathrm{N}}({\beta }_{i},{\hat{s.e.}}_{i}^{2})\) and true monitor-specific infiltration values as drawn from an underlying normal distribution with unknown mean and variance βi ~ N(β, σ2).

We then train flexible machine-learning-based models to predict monitor-specific infiltration rates from matched covariates (Supplementary Information). We divide our sample into a 75% training dataset and a 25% held-out test dataset, splitting train and test within 13 disjoint geographic regions covering the contiguous United States to ensure a geographically balanced split. We train random forest and gradient boosted trees models with manually tuned forest and boosting hyperparameters, respectively, and tree parameters tuned using random search with threefold cross-validation repeated five times. We conduct tuning and training for each method of matching monitors and CoreLogic houses for robustness.

We report performance statistics (R2) on held-out test data and compute the marginal effect of each predictor by evaluating the predicted effect in the test data of moving from the 5th percentile to the 95th percentile of the predictor, with all other variables fixed at their mean values. We repeat this evaluation for both random forest and gradient boosted trees models, and for all four ways of spatially matching to housing characteristics.

Understanding the representativeness of the study samples

See the Supplementary Information for a discussion of sample representativeness across our multiple datasets.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.