Introduction

In January 2020, a new strain of coronavirus, COVID-19 (SARS-CoV-2) was first reported in the United States. Due to the contagious nature of the virus, a country-wide stay-at-home order, issued by the government, was put in place on March 16, 2020. Currently, several versions of the vaccine have been produced by many pharmaceutical companies. However, despite the initial distribution of vaccines, the primary method to fight this virus has been social distancing. More research needs to be done in order to determine whether the vaccines prevent asymptomatic infections, whether vaccinated individuals can spread the disease and how long immunity from the vaccines will last. Until these questions are answered and enough people have been vaccinated to establish herd immunity, health officials recommend social distancing practices to slow the spread of COVID-19 (Mallapaty, 2021; Mayo Clinic Health System, 2021).

Social distancing is when a community abstains from interacting with each other closely so that they can mitigate the transmission of the virus. Social distancing guidelines include staying at least 6 feet away from another person, refraining from gathering in groups, and avoiding visits to non-essential establishments (Centers for Disease Control and Prevention, 2020b). In this study, social distancing adherence (SoDA) is the rate at which a community follows the social distancing guidelines. Being able to prioritize risky populations, defined as populations with lower SoDA scores, can help officials make quicker and more effective decisions on resource distribution and social distancing guidelines and regulations.

Other studies have found that a relationship exists between political and socioeconomic features and SoDA (Allcott et al., 2020; Kavanagh et al., 2020; Painter and Qiu, 2020). Republican-leaning counties were shown to be less adherent to social distancing practices than Democratic counties (Allcott et al., 2020; Painter and Qiu, 2020). Counties with lower per capita income and a high proportion of racial minorities were also found to be associated with lower SoDA (Dasgupta et al., 2020; Kavanagh et al., 2020). All of these studies used mobile phone data to estimate SoDA but did not include the distance between the phones as a measure of human encounters. Other reports have addressed SoDA and the risk of infection through social gatherings using social network-based and probabilistic event-based models, but these studies did not factor in socio-economic features to predict SoDA (Block et al., 2020; Chande et al., 2020).

In the case of additional waves of virus infections and new virus variants, it is important to be able to predict which counties will need more respirators, ventilators, drugs and basic personal protective equipment (PPE), like masks, gloves and gowns. Knowing which locations will need to implement more drastic social distancing regulations is vital to preventing another surge in medical complications and deaths. Lastly, if we know which communities are struggling most to adhere to social distancing, policy makers can make guidelines which make social distancing easier and more accessible for those who need it most.

The aim of our analysis was to develop a prediction tool with improved accuracy to guide future health policy planning. With this prediction tool, policy can be made to mitigate the stress put on the health care infrastructure, control the spread of the virus, and manage economic burden.

Methods

Model overview

We developed a multivariable bagging regression algorithm to predict the SoDA scores of each United States county using 45 predictor features. This bootstrapping technique was used to improve accuracy on the current data, robustness for unseen data, and decrease overfitting. In each sample of the bagging algorithm, a regression model was trained on an 80%/20% training/test split. Graphs of the counties included in the training set and the test set, respectively, can be found in the supplementary materials (Supplementary Fig. 1a, b). The regression models per sample were then aggregated to form the SoDA model. To rank the degree to which each feature correlates SoDA independently, we used univariable linear regressions between each feature and SoDA and calculated the beta coefficient for every regression. Using this result, we were able to create a subsequent model using the top 25 most substantial features on the model in order to determine if reducing the number of features had an effect on model performance. A model using features only related to the COVID-19 pandemic was also created to determine the effect of these features on SoDA prediction. For this version of the model, we solely used the cumulative COVID-19 case and death toll for each county, the days since a state issued Stay-At-Home order if applicable, and the days since the first COVID-19 case and death in each county.

Unacast social distancing data

The data used to estimate the degree to which counties were adhering to social distancing was taken from mobile phone movement and location data provided by Unacast (Unacast, 2020). Unacast transformed their mobile phone data into a grading system for SoDA based on three different metrics. The three metrics that comprise the Unacast social distancing score were percent difference in average distance traveled compared to pre-COVID-19 period (Metric 1), percent difference in visitation to non-essential places of interest (restaurants, retail centers, etc.) compared to pre-COVID-19 period (Metric 2), and the rate of human encounters per square kilometer compared to the national average (Metric 3). Pre-COVID-19 was defined as the 4 weeks prior to March 8 (February 9, 2020 to March 8, 2020). Devices were ascribed to counties based on where the device was for the longest period of time on a specific day, which accounts for both the difference between work and home and for the people who moved to other places to live during the pandemic. Each metric was quantified into a score from 5 being the highest SoDA and 1 being the lowest. More detail on the scoring system can be found in the Unacast US SDS Methodology document listed in the “Resources” section.

Metric 1 was measured by averaging the distance travelled across all devices per county. The metric was calculated for every day as a percent difference from the average distance traveled on the same day in the same county pre-COVID-19. The metric was shown to be strongly correlated with confirmed cases. Metric 2 was calculated for every day as a percent difference in visitation of non-essential places of interest from the baseline visitation in the same county on the same day pre-COVID-19. If a device was shown to be in a non-essential place of interest, it was counted as visiting. A complete list of the non-essential places of interest can be found in the Unacast US SDS Methodology document. Metric 3 was calculated as a percentage difference in the number of close encounters between two devices per square kilometer from the national pre-COVID-19 average. A close encounter of devices was considered to be a spatial distance of 50 m or less and a temporal distance of 60 min or less. Only land area was considered when normalizing the number of encounters by area. Metric 3 was included to account for population density. The final social distancing score for a county on a particular day post-COVID-19 was calculated by taking the average of the 3 metrics. The counties that were excluded did not have sufficient mobile phone data to be included into the analysis. SoDA scores were derived from averaging the overall social distancing grade per county from March 16, 2020 (first day of the national stray at home order) to April 24, 2020 (first day a state relaxed the social distancing guidelines). SoDA scores were calculated for 3054 United States counties (Fig. 1).

Fig. 1: Social Distancing Adherence (SoDA) heat map by United States county.
figure 1

Authors’ analysis of data from the Unacast Social Distancing Dataset between from March 16, 2020 (first day of the national stray at home order) to April 24, 2020 (first day a state relaxed the social distancing guidelines). States that are grey did not have sufficient amount of cellphone data to be included.

Socioeconomic data collection

Data on obesity rates, diabetes rates, COVID-19 cases and deaths were collected from the CDC (Centers for Disease Control and Prevention, 2020a). County 2016 presidential election voting data was obtained the MIT Election Data and Science Lab (MIT Election Data and Science Lab, 2018). Data on days since state Stay-At-Home order was scraped from CNN news reports (Rose, 2020). All other predictor features used in the model were collected from American Census Survey (ACS, 5-year averages from 2014 to 2018) (United States Census Bureau, 2020). For all the predictor features, we used the most recent data available as inputs into the model.

Statistical analysis

SoDA scores were expressed as means of daily SoDA scores from March 16, 2020 to April 24, 2020. Univariate regressions analyses were done to assess the association between the features in Table 1 and SoDA scores and obtain the β coefficient for the features. The regression analyses were two-tailed with an alpha level of 0.05. A p-value of 0.05 and below was considered statistically significant and the n value of the regressions of the number of counties (3054). Accuracy and goodness of fit of the model were determined by mean squared error and the coefficient of determination, respectively. Algorithms were implemented in Python using scikit-learn library. Analysis code and data repository can be found in the Github link in the “Resources” section.

Table 1 Social distancing adherence (SoDA) model features.

Results

Feature correlations

Owner-occupied housing unit rate was the strongest negatively correlated feature (β = −0.322, P < 0.00001) with SoDA and persons that work from home (β = 0.259, p < 0.00001) was the strongest positively correlated feature (Table 1). Features related to age had significant correlations with SoDA with percentage of the county population 65 and over (β = 0.221, p < 0.00001) and median county age (β = 0.204, p < 0.00001) being among the top 25 most substantial features, with both features being positively associated with SoDA scores. Both days since the first COVID-19 case in a county (β = −0.276, p < 0.00001) and days since the first COVID-19-related death (β = −0.202, p < 0.00001) were negatively associated with SoDA scores. Several features related to economic status and commuting habits such as persons owning a house (β = 0.185, p < 0.01), unemployment rate (β = −0.067, p < 0.00001), per capita income (β = 0.0279, p = 0.0279), persons owning a vehicle to commute (β = −0.272, p < 0.00001), and mean travel time to work (β = −0.124, p < 0.00001) also correlated strongly with county social distancing. Additionally, the number of votes for the Republican presidential candidate in 2016 (β = −0.137, p < 0.00001) and percentage of Black/African American populations in a county (β = −0.134, p < 0.00001) were both significantly negatively correlated with the level a county adhered to social distancing guidelines. Beta coefficients and p-values of features used in the top 25 features model are shown in Table 1. The features that did not significantly correlate with SoDA scores were cumulative COVID-19 deaths (β = 0.0124, p = 0.328) and cases (β = 0.00756, p = 0.552), persons with bachelor’s degree or higher (β = −0.00208, p = 0.87), persons without health insurance (β = −0.0239, p = 0.06), and percentage of Chinese/Hispanic populations in a county (β = −0.00292, p = 0.818 and β = −0.00252, p = 0.842) (Table 1).

Model results

Using our base SoDA model, COVID-19-related features model, and top 25 feature model, we ran an analysis to predict the SoDA scores of 3054 United States counties from March 16, 2020 to April 24, 2020. The results of these models are shown in heat maps in Figs. 24.

Fig. 2: Base model results heat map by United States county.
figure 2

States that are grey did not have sufficient amount of cellphone data to be included. Authors’ analysis of authors’ model based on data from the Unacast Social Distancing Dataset between from March 16, 2020 (first day of the national stray at home order) to April 24, 2020 (first day a state relaxed the social distancing guidelines), CDC, MIT Election Data and Science Lab, CNN news reports, and American Census Survey (ACS, 5-year averages from 2014 to 2018).

Fig. 3: COVID-19-related features model results heat map by United States county.
figure 3

Authors’ analysis of authors’ model based on data from the unacast social distancing dataset between from March 16, 2020 (first day of the national stray at home order) to April 24, 2020 (first day a state relaxed the social distancing guidelines), CDC, and CNN news reports. States that are grey did not have sufficient amount of cellphone data to be included.

Fig. 4: Top 25 most substantial features model results heat map by United States county.
figure 4

Authors’ analysis of authors’ model based on data from the unacast social distancing dataset between from March 16, 2020 (first day of the national stray at home order) to April 24, 2020 (first day a state relaxed the social distancing guidelines), CDC, and CNN news reports. States that are grey did not have sufficient amount of cellphone data to be included.

Base model

We found that the base SoDA model predicted the SoDA scores of the counties with a 91.6% accuracy (Fig. 2). The model produced a training accuracy of 91.4% (Supplementary Fig. 2a), test accuracy of 92.3% (Supplementary Fig. 2b) and a coefficient of determination of 0.830 on this data.

COVID-19-related features model

The COVID-19-related features model scored an accuracy of 64% in predicting SoDA scores (Fig. 3). This version of the mode scored an accuracy of 64.1% on the training set (Supplementary Fig. 3a), an accuracy of 63.9% on the test set (Supplementary Fig. 3b) and had a coefficient of determination of 0.274.

Top 25 features model

The model using the top 25 most substantial features, determined by beta coefficient and p-value, predicted the county SoDA score with an accuracy of 89.0% (Fig. 4). The top 25 features model had a training set accuracy of 88.7% (Supplementary Fig. 4a), a coefficient of determination of 0.777, and a test set accuracy of 89.9% (Supplementary Fig. 4b).

Discussion

Our results show that economic features impacted a county’s adherence to social distancing. Persons using vehicles to commute and owner-occupied housing rate data indicate that those who live in more suburban areas, who are more likely to use cars to get around or to get to work, have lower adherence. Families in poverty data indicates that those in poverty have a harder time affording the luxury of social distancing, as those with lower incomes feel they need to work in order to survive (Dasgupta et al., 2020). On the contrary, per capita income correlates positively with SoDA, suggesting those with higher incomes are more likely to afford the ability to social distance.

Health features had varied correlations with adherence, particularly among populations who have greater risk of severe illness from COVID-19. Although the diabetic and obese populations are at-risk populations, counties with higher obesity and diabetes rates had lower adherence. Because of the lower adherence, COVID-19-related hospitalizations and deaths for people with obesity and diabetes could increase dramatically in these counties. Conversely, we found that, in terms of age, risk of severe illness positively trended with adherence. Communities with persons above 65 years old, who are members of the COVID-19 at-risk population, and a higher median age both had higher adherence. Counties that had a higher rate of individuals aged 18 or younger had lower SoDA, most likely due to the fact that they have stronger immune systems than people above the age of 65 and felt less of a need for social distancing. The cumulative COVID-19 cases and deaths had a negligible impact on predicting SoDA and, furthermore, the model using only COVID-19-related features had lower accuracy than the model with all 45 demographical features. These findings highlight the importance of socioeconomic features in the decision to adhere to social distancing guidelines.

We also found that the proportion of Black/African Americans in a county correlated negatively with SoDA. These findings were consistent with previous studies (Kavanagh et al., 2020; Yancy, 2020). Other studies have shown that nearly half of the COVID-19 cases and 60% of COVID-19 deaths come from African American (Laurencin and McClinton, 2020; Mahajan and Larkins-Pettigrew, 2020). Furthermore, the percentage of African–Americans living in a county and the percentage of county COVID-19 confirmed were positively correlated (Laurencin and McClinton, 2020; Mahajan and Larkins-Pettigrew, 2020; Yancy, 2020). These findings suggest overcrowding in these communities, which maintaining social distancing guidelines cumbersome and very difficult. Other explanations could be that these communities cannot afford to social distance or the myth of black immunity was detrimental to SoDA.

Other notable features with insightful correlations to adherence included political affiliation for the 2016 presidential election, persons who worked from home prior to the pandemic, and the amount of days that passed since the county’s first COVID-19 case and/or death. We found that counties that had more votes for the Republican candidate in 2016 were less likely to follow social distancing guidelines, a result that supports previous findings (Allcott et al., 2020; Kavanagh et al., 2020; Painter and Qiu, 2020). The proportion of the population that worked from home prior to the pandemic had the strongest positive correlation with adherence. A possible explanation for this result is that people who were already working from home had to make less adjustments to their daily routine to adhere to social distancing guidelines compared to people who had to commute to work, thus making social distancing more accessible. We found that the more days that passed since the first COVID-19 case and death, the lower the adherence. A possible explanation for these findings is that the counties with earlier first case/death dates eclipsed the highest case/death rate day earlier, thus diminishing the apparent risk of COVID-19 infection (Qureshi et al., 2020).

In order to best prepare for additional waves and new virus strains, government officials need tools to accurately predict the spread of the virus. Our prediction model using socio-economic data, such as demographic, economic and health data, in addition to COVID-19 death and case tolls can provide accurate predictions of SoDA with a high level of confidence, thus predicting future cases. Additionally, the proposed model can be improved to predict daily social distancing behavior and adherence. In cases where, for example, social distancing occurs less on weekends, states and counties can use this data to formulate and place stricter guidelines on the weekends to avoid setbacks. This prediction model can also be used to strengthen current social network models and/or website risk assessment tools by using more granular socio-economic data (Block et al., 2020; Chande et al., 2020).

There are some limitations in this model, such that our data does not account for whether someone is wearing a mask or gloves when they are near each other. However, recent studies have shown that wearing a mask, along with handwashing, is associated with adherence to social distancing protocols (Doung-ngern et al., 2020; Marchiori, 2020). Additionally, the data does not include information about non-phone users. Aggregating the SoDA scores over the course of the quarantine may have caused different results than observing the scores on a daily basis. A future model could explore the features that correlate with daily SoDA. Lastly, the Census data might also not reflect current demographics, as the estimates are as recent as 2018.

In summary, our analysis found that features related to economic status, political affiliation, age, race, diabetes, obesity, and date of first COVID-19 case/death correlated strongly with SoDA of a county. We demonstrated that these demographic features were capable of creating an accurate model for predicting county adherence. This prediction model can be a tool to aid health policy planning in the United States in the case of additional waves of COVID-19 or in the event of another pandemic. Future research is necessary to gain more insight into the underlying reasons for the correlations between the features of the model and SoDA.

Resources

Github Repository: https://github.com/Ingrammyles8/SoDA_prediction_model

Unacast US SDS Methodology: https://github.com/Ingrammyles8/SoDA_prediction_model/blob/master/methodology/US_SDS_Methodology_Unacast.pdf