Introduction

The COVID-19 pandemic has caused great disruption. Over 43 million people have confirmed diagnoses of the disease, and over 1 million people have died from it1. It has also had substantial impacts on daily lives and economic activities2,3. Many studies have focused on measuring who are affected the most by COVID-194,5, or which therapies are appropriate at each stage of the disease6,7,8. However, it is also crucial to understand how the spread of COVID-19 depends on preventive measures such as social distancing and how the reopening may affect the spread.

The most common model used to study the spread of COVID-19 is the susceptible-infected-recovered (SIR) model. In such models, there is a susceptible population, which is assumed to be equal to the population of whichever region is being examined minus the number of people that have previously had the disease. Some of the susceptible individuals get infected in each period, where the rate of infection is a function of the number of infectious individuals as well as other factors that shift the rate of transmission. Finally, infectious individuals move to a state of recovery. In our analysis, we call anyone who was sick but is no longer infectious to be “recovered,” although some of these people may still actually be sick, hospitalized, or have died. Thus, the recovered terminology is actually a shorthand for all post-infectious states. This model, and its variants, have been used extensively to study the growth of COVID-19. For example, a recent study estimates a Susceptible-Exposed-Infected-Confirmed-Removed (SEIQR) model, which appends the standard SIR model with a stage modeling susceptible people who become exposed to the virus and a stage modeling infected people which are confirmed to have the disease9. The paper then applies this model to estimate COVID-19 transmission in Wuhan, China, showing that an earlier lockdown makes the outbreak worse in Wuhan but helps the rest of the world. The SEIQR model is also used to show that travel restrictions may have reduced the spread of COVID-19 from Wuhan, China, to other Chinese cities10,11. As another variant to the SIR model, the SEIR model (adding an exposure stage to the SIR model) is also applied to compute the rate of transmission both from animals to people and from people to people12. While it may seem that having more stages in the model would make the SEIR model superior to the SIR model, it has been shown that the standard SIR model does a better job at predicting the spread of COVID-19, based on data from Wuhan, China13.

In this paper, we use a modified version of the SIR model to measure the extent to which social distancing reduces the speed at which COVID-19 spreads. We then run simulations to forecast the rates of COVID-19 spread under different social distancing levels during the reopening. We find that COVID-19 spreads less than proportionately with the number of infectious individuals, a distinct difference from the assumption of standard models. We demonstrate that this pattern could be explained by the interconnectedness of people’s social networks. This pattern suggests that each additional infectious individual has less impact on the disease spread as more people become infected. One key implication of this finding is that the rate of disease growth can be slow and steady, rather than either exponential or falling quickly, as would be implied by the most-commonly used models. This leads to more accurate predictions of the spread of COVID-19. We also observe that social distancing greatly reduces the spread of COVID-19.

Mathematically, we model transmission of COVID-19 as

$$\begin{aligned} y_{i,t}=R_{i,t}S_{i,t}\left( Y_{i,t-2}-Y_{i,t-8}\right) ^{\omega } \end{aligned}$$
(1)

where \(y_{i,t}\) is the number of new infections in county i on date t, \(R_{i,t}\) is the rate at which infectious individuals transmit the disease, \(S_{i,t}\) is the percentage of the county population that has not yet had COVID-19, and \(Y_{i,t}\) is the cumulative number of individuals who have been infected by date t. Correspondingly, the \(Y_{i,t-2}-Y_{i,t-8}\) term reflects our assumption that infected individuals are infectious from the second day after they catch the virus through the seventh day. This implies that the average serial interval is 4.5 days under the assumption that the level of infectiousness and the level of contact with susceptible individuals is constant during this time14. This treatment of the infectious population is an approximation of the standard SIR model, where the infectious population is typically modeled as a stock that has a constant outflow rate. Discretizing the rate of transmission enables the estimation of a large number of county and date fixed effects in our model, and as a practical matter this assumption has little impact on our estimates of the contagion rate. As a robustness check, we obtain extremely similar COVID-19 forecasts if we take the time of infectiousness to be 14 days, \(Y_{i,t-2}-Y_{i,t-16}\), instead of 6 days, as presented in the appendix. The main difference between our model and the standard SIR model is the inclusion of the exponent \(\omega\) on the number of infectious individuals. This \(\omega\) allows the rate of growth of COVID-19 to be less than proportionate with the number of infectious individuals if \(\omega <1\). Such a result would be expected if infectious individuals expose many of the same unexposed individuals, which could occur if people have overlapping social connections. We see this directly when, for example, cases are clustered within households, nursing homes, or places of work. Thus, we can think of \(\omega\) as measuring the extent to which people’s networks are more interconnected to a tight-knit group of individuals relative to their level of connectedness to the population as a whole.

We also allow the transmission rate \(R_{i,t}\) to vary with a number of factors instead of treating it as a constant parameter:

$$\begin{aligned} R_{i,t}=\exp \left( \alpha _{i}+\beta _{t}+\lambda d_{i,t}+\mu m_{i,t}+\theta h_{i,t}+\varepsilon _{i,t}\right) . \end{aligned}$$
(2)

We use \(d_{i,t}\), \(m_{i,t}\), and \(h_{i,t}\) to represent the level of social distancing, temperature, and humidity in county i on date t, respectively, and \(\varepsilon _{i,t}\) is the statistical error term. The parameters \(\alpha\) and \(\beta\) are vectors of county and date fixed effects, where the i-th element of \(\alpha\), \(\alpha _{i}\), represents the fixed effect for county i. Similarly, the t-th element of \(\beta\), \(\beta _{t}\), represents the fixed effect for date t. These fixed effects measure the baseline transmission rate of each county and each date, respectively. The parameters \(\lambda\), \(\mu\), and \(\theta\) measure the impacts of social distancing, temperatures, and humidity on transmission rates, respectively. In short, this specification allows transmission rates to differ across counties (through the county fixed effects), dates (through the date fixed effects), levels of social distancing, temperatures, and humidity. We note that the impacts of the last two factors have been debated in the literature15,16,17. The county fixed effects account for differences in demographics across counties, such as the demographics shown in Table 2 below as well as other unobservable county-specific factors. The date fixed effects account for both day-of-the-week differences in the patterns of travel for people (e.g., the time away from the house to go to work or to go to the park, which may lead to different exposures to the disease) as well as differences in the rate of testing and reporting that occur across time. As a robustness check, we also include the state-level testing numbers directly into Eq. (2) during estimation. The results are statistically indistinguishable from the main results, as noted in “Results and simulation” below.

The social distancing measure, \(d_{i,t}\), is based on cellphone GPS location data that are provided by SafeGraph for free to researchers studying COVID-19. We measure social distancing as the first principle component of several daily measures of each county: the percentage of residents staying home, the percentage of residents working at workplace full-time, the percentage of residents working at workplace part-time, the median duration of residents staying home, and the median distance of residents traveled.

As noted earlier, the most crucial difference between our model and a standard SIR model is that a standard SIR model constrains the exponent \(\omega =1\). We instead find that \(\omega =0.57\). Thus, the marginal impact of one more infected person diminishes as more people are infected. Such a result would be expected if infectious individuals expose many of the same unexposed individuals within a clustered network of individuals. In the appendix we demonstrate that a networking model with contagion can yield \(\omega <1\).

Table 1 Estimation of a modified SIR model.

Results and simulation

The estimated model appears in Table 1, with standard errors (s.e.) reported in the parentheses. The estimated exponent on the number of infectious people, \(\omega\), is 0.57. Thus, the number of new infections is concave with respect to the number of infectious individuals. This level of concavity also implies that while initial outbreaks of COVID-19 expand exponentially, the daily number of new cases quickly stabilizes to a long-term plateau. We also find that social distancing has a large impact on the growth rate of COVID-19, while humidity has a smaller effect and temperature is insignificant. (When including daily testing numbers of each state in Eq. (2), the estimates of \(\omega\) and social distancing are 0.568 (s.e. = 0.014) and − 0.816 (s.e. = 0.246), respectively).

All county-level demographic factors remain constant over time in our analysis. While our main regression gives many insights, impacts of these demographic factors on the spread of the virus are captured by the county fixed effects. In order to better understand how these factors affect the contagion rate, we next regress the county fixed effects \(\alpha\) on several demographic variables of each county. The coefficients from this regression should be thought of as the impacts of these demographics on the transmission rate. The results from this regression are reported in Table 2. We observe that the contagion in the disease is increased with greater population density and the percentage of commuters who use public transportation. We also observe that contagion rates are higher in areas with a higher fraction of Black and Hispanic residents. Furthermore, the rate of spread is higher for seniors than for younger people, but children and non-senior adults do not seem to have statistically significantly different rates of contagion.

Table 2 Analysis of county fixed effects.

We next measure the out-of-sample prediction accuracy of our model using a hold-out sample of 75 days (May 24–August 6) to see how well our model forecasts new cases. We use the observed county level of daily social distancing for our out-of-sample predictions. Nationally, this reflects an approximately 50–60% return-to-normalcy, but this varies quite a bit across the country. We define the percentage return-to-normalcy as \(\frac{SocialDistancingPeak_{i}-SocialDistancing_{i,t}}{SocialDistancingPeak_{i}-SocialDistancingBeforeCOVID_{i}}\), where \(SocialDistancingPeak_{i}\) is the social distancing level in county i at its peak (April 5–April 11, 2020), \(SocialDistancingBeforeCOVID_{i}\) is the observed lowest level of social distancing in February, and \(SocialDistancing_{i,t}\) represents social distancing level on date t. For example, a 25% towards normalcy represents social distancing at the level of 0.25\(\times\)(minimum social distancing) + 0.75\(\times\)(maximum social distancing).

Figure 1 shows the US actual cumulative cases along with out-of-sample forecasts from a model with \(\omega =0.57\) and a standard model with \(\omega =1\). The black hashed line represents the actual cumulative cases in the US. The green solid line and the red dashed-line show the out-of-sample forecasts with \(\omega =0.57\) and \(\omega =1\), respectively. We readily observe that the model with \(\omega =0.57\) fits the data well while the model with \(\omega =1\) does not. Three states that had their Shelter-in-Place orders expire or stuck down early are Florida, Georgia, and Wisconsin. To further evaluate our model’s accuracy in prediction, we repeat the same out-of-sample prediction comparisons for these three states in Figure 2. The figure again shows that the model with \(\omega =0.57\) has a much better fit than the model with \(\omega =1\).

Figure 1
figure 1

Out-of-sample fit comparisons of the US between our model and standard SIR model. The vertical line on May 23, 2020 indicates the last date used to estimate each model.

Figure 2
figure 2

Out-of-sample fit comparisons of Florida, Georgia, and Wisconsin between our model and the standard SIR model. The vertical line on May 23, 2020 indicates the last date used to estimate each model. The vertical lines to the left indicate the expiration date of Shelter-in-Place order in each state.

We next simulate daily and cumulative cases from August 7 to October 31, 2020 under different levels of social distancing. When forecasting future cases, we use previous 5-year county temperature data and the May 2020 county average humidity. The top of Figure 3 shows three sets of forecast daily cases after August 6, corresponding to 75%, current, and 25% levels of return-to-normalcy. We observe that social distancing at the current 60% return-to-normalcy first leads to a slightly increasing but then slowly decreasing number of cases, going from around 55,000 cases per day in early August to 25,000 cases per day in the end of October. If the US practiced social distancing at the level reflecting a 25% return-to-normalcy for even a few weeks, new cases would drop to a much lower level of around 9,000 per day. On the other hand, a return to a 75% level of the normalcy would cause cases to surge for about two months. The pattern of the surge is consistent with recent studies on the relaxation of non-pharmaceutical interventions such as shelter-in-place orders18. However, after two months cases would again reach a long-term plateau, although this would occur at a level that was almost double of what would be experienced under the early-August level of social distancing. The bottom of Figure 3 depicts the corresponding cumulative cases for the same time period under 100%, 75%, current, 25%, and 0% levels of return-to-normalcy. The figure shows a consistent pattern where the cumulative cases look almost linear after the initial take-offs. There would be substantially more cases if we returned to the pre-COVID level of social distancing.

Figure 3
figure 3

Daily and cumulative cases forecasting under different reopening strategies. The vertical line indicates the last date of case data sample.

Methods

In this subsection, we detail the assumptions we make and the estimation procedure. The model is laid out in Eqs. (1) and (2) above. For simplicity, we rewrite Eq. (2) as \(R_{i,t}=\exp \left( X_{i,t}^{\prime }\Phi +\varepsilon _{i,t}\right)\), where \(X_{i,t}\) includes county dummy variables, date dummy variables, the measure of social distancing \(d_{i,t}\), and daily average temperature \(m_{i,t}\) and humidity \(h_{i,t}\). \(\Phi\) is the vector containing the parameters \(\alpha\), \(\beta\), \(\lambda\), \(\mu\), and \(\theta\), which measures the impact of each element in the vector \(X_{i,t}\) on the transmission rate \(R_{i,t}\). We assume that the errors \(\varepsilon _{i,t}\) are uncorrelated across counties. We further assume that \(\varepsilon _{i,t}\) is uncorrelated across time, although we cluster the standard errors by county.

We estimate the model by taking logarithm of both sides. After rearranging we get:

$$\begin{aligned} \left[ \ln \left( y_{i,t}\right) -\ln \left( S_{i,t}\right) \right] =X_{i,t}^{\prime }\Phi +\omega \ln \left( Y_{i,t-2}-Y_{i,t-8}\right) +\varepsilon _{i,t} \end{aligned}$$
(3)

Note that sometimes \(y_{i.t}\), the diagnosed case number, is 0 for some counties on some dates. Therefore, we adjust this formula slightly by adding 1 to \(y_{i,t}\) so the logarithmic values are always well-defined:

$$\begin{aligned} \left[ \ln \left( y_{i,t}+1\right) -\ln \left( S_{i,t}\right) \right] =X_{i,t}^{\prime }\Phi +\omega \ln \left( Y_{i,t-2}-Y_{i,t-8}\right) +\varepsilon _{i,t} \end{aligned}$$
(4)

In some counties, \(Y_{i,t-2}-Y_{i,t-8}\) is 0 for some periods. We do not use those observations for estimation. Note that because this is a lagged variable, this is a selection based on independent variables and not based on dependent variables, and hence it does not bias our estimation.

One concern that can arise in estimating this model is that social distancing levels (and regulations) are not determined in a vacuum: Rather, people social distance more in areas that are hit harder by COVID-19. Thus, \(\varepsilon _{i,t}\) may be correlated with social distancing, causing a biased measurement of the impact of social distancing on the rate of contagion. We thus use an instrumental variables (IV) technique to control for this endogeneity bias, where the amount of rain is our instrument for social distancing. Specifically, we assume that rain directly shifts the level of social distancing, but is not correlated with \(\varepsilon _{i,t}\) conditional on the temperature and humidity. Several other papers have used rain as an instrument for social distancing19,20,21,22. The first-stage F-statistic for the strength of rain as an instrument is 214.44, which is highly significant, indicating that rain is a strong instrument.

Data

Our data come from a multitude of sources. We detail the data sources at https://github.com/songyao21/covid_data_depot. There are a few nonstandard issues to note. Our data on COVID-19 cases consists of county-level, officially confirmed daily case data of 2,924 US counties from February 1 to August 6 (with the last 75 days used as a hold-out sample). COVID-19 also has an incubation period of approximately 5 days23,24. Because of this lag from infection to diagnosis, we assume that cases reported on a particular date actually measure the COVID-19 infections from 5 days earlier. We also assume that the true number of cases is approximately 10 times the number of diagnosed cases. We get this number by assuming that the Infection Fatality Rate (IFR) is 0.75%25. We also assume that any deaths occur 14 days after the confirmed test results. On May 23, 2020, the last day of our estimation case data, there were 92,622 deaths in the US. On May 9, 2020, there were 1,304,726 officially diagnosed cases. We hence obtain the factor as (92,622/0.0075)/1,304,726 = 9.5. We round this number up to 10. This is consistent with Centers for Disease Control and Prevention (CDC) director Robert Redfield’s estimate of the ratio between actual and confirmed cases26. Our estimates are not sensitive to the specific factor we use. When we run the simulations, we divide our model’s predicted case numbers by 10, which gives us the prediction of diagnosed cases.

Conclusion

We use a modified SIR model to study the impacts of different factors on the spread of COVID-19. We find that the impact of each additional infectious individual decreases as more people become infected. A potential mechanism underpinning this finding is that infections are more likely to occur within interconnected networks. Understanding the shape of this relationship, and the nonlinear aspects of it, are important for understanding how COVID-19 spreads. Unlike previously-estimated SIR models, our model allows for the possibility that the contagion process will grow or shrink at relatively steady levels, whereas traditional SIR models have contagion either taking off exponentially (if \(R>1\)) or falling quickly (if \(R<1\)).

We further find that social distancing helps to curb the speed of the spread. Consequently, we need to be cautious of breakouts in networks and maintain a reasonably high level of social distancing during the reopening of the economy. Taking the network effects and social distancing effects together gives more accurate forecasts about the timeline of the disease spread, and the ability to analyze and set policies about when to instate shelter-in-place restrictions or when to allow businesses to be open.