1 Introduction

The world is experiencing an enormous death toll from COVID-19. The number of COVID-19 cases and deaths vary spatially and temporally and can be affected by many factors, some local some global. There is a large body of literature that investigates the key biological, socioeconomic, and environmental factors that might increase the degree of severity of the health outcomes after having contacted COVID-19 (Karmakar et al. 2021; Webb Hooper et al. 2020; Yancy 2020; Abdelzaher et al. 2020; Ali and Islam 2020; Fiasca et al. 2020; Giani et al. 2020). Concerning the environmental factors, it is well established that short- and long-term exposure to air pollution increases the risk of several chronic diseases, including cardiovascular and respiratory diseases, irrespective of COVID-19 (Jiang et al. 2016; Lelieveld and Münzel 2019). We and others (Chakrabarty et al. 2020; Liu et al. 2020; Ogen 2020; Jiang and Xu 2021; Yongjian et al. 2020; Conticini et al. 2020; Comunian et al. 2020) have hypothesized that exposure to air pollution increases the severity of COVID-19 outcomes, because air pollution can affect our immune, respiratory and cardiovascular system. This is a rapidly evolving area of research, see, for example (Bhaskar et al. 2020) for a review of the epidemiological studies on this topic.

In this paper, we introduce a Bayesian spatiotemporal model to estimate the association between long-term exposure to PM2.5 and COVID-19 health outcomes. To address this scientific question, we need to overcome several challenges. These include: 1) a large number of zero counts, especially at the beginning of the pandemic; 2) complex spatiotemporal variation that remained unexplained after having accounted for several measured confounders; and 3) computational feasibility. To overcome the challenges listed above and many others, we introduce a Bayesian model with multivariate spatiotemporal distributions of random effects while accounting for several measured socioeconomic and demographic factors. We modeled the COVID-19 death counts via a zero-inflated negative binomial (ZINB) distribution (Neelon et al. 2019). Since the frequentist approach to fitting the ZINB model is challenging for longitudinal, spatial, and spatiotemporal data, particularly when the model includes multivariate spatial random effects, we have chosen a more tractable Bayesian approach proposed by Neelon et al. (2019).

We apply our Bayesian model to a data set that includes weekly county-level death counts, air pollution levels and many other potential confounders from the USA. We incorporate the spatial and spatiotemporal information into the model by assigning a multivariate intrinsic conditionally autoregressive (ICAR) prior structure (Banerjee et al. 2014) to the random effects. Additional time-fixed effects are also considered. Then, we analyze the effectiveness of the zero-inflated model compared to an ordinary negative binomial model.

Wu et al. (2020) also considered a ZINB model at a county level to investigate the association between long-term exposure to PM2.5 and COVID-19 deaths using an ecological and cross-sectional study design. These authors also considered state-specific random effects to capture variation between states. While Wu et al. (2020) investigated global association by using the data over most of US counties, no spatial dependence nor temporal dependence was assumed in the model, which is the main difference from our modeling. Instead, we decided to analyze spatiotemporal county-level data within regions consisting of multiple states that are geographically connected. This can account for heterogeneous dynamics of the spread of disease across different regions of the USA. For our analysis, we consider four regions: Mid-Atlantic (New Jersey, New York, and Pennsylvania), Pacific (California, Oregon, and Washington), South Atlantic (Florida, Georgia, North Carolina, South Carolina), and Midwest (Iowa, Kansas, Missouri, Nebraska, North Dakota, and South Dakota). The results of our model show an overall positive association between long-term exposure to the (PM2.5) and the mortality from the COVID-19 disease, which matches with other previous studies (Bhaskar et al. 2020).

The rest of the paper is organized as follows: In Sect. 2, we describe and visualize the data set for four regions in the USA. In Sect. 3, we present the methodology, including how we leverage the spatial and spatiotemporal information to estimate the COVID-19 spread. In this section, we also compare different statistical models. In Sect. 4, we present the results by applying the methods to the data set. A simulation study along with more details of the real data analysis is available in the Supplementary document. In Sect. 5, we discuss the results and comment on future directions.

2 Data

2.1 COVID-19 Death Counts

We accessed the data from the repository maintained by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We obtained daily number of deaths in each county from March 23, 2020, to August 31, 2020. Please note that the start date of the data source is March 22, 2020.

To investigate different scenarios of temporal and spatial dynamics of the COVID-19 deaths, we consider four regions: Mid-Atlantic (New Jersey, New York, and Pennsylvania), Pacific (California, Oregon, and Washington), South Atlantic (Florida, Georgia, North Carolina, South Carolina), and Midwest (Iowa, Kansas, Missouri, Nebraska, North Dakota, and South Dakota). Instead of considering the ratio of a COVID-19 count to a county-level population size, the county-level population size was considered as an offset variable in the model to remove the obvious effect of the county-level population size on the COVID-19 death counts. For spatiotemporal models, we computed 23 weekly COVID-19 death counts from March 23 to August 31, 2020 by aggregating daily counts from Monday to Sunday without overlapping. For spatial models, which we consider for comparison, we used cumulative counts since the beginning of the study period. For sensitivity analysis, we consider a different starting day of a week to calculate COVID-19 weekly death counts for spatiotemporal models. For spatial models, we consider a different length of aggregating COVID-19 death counts.

2.2 Exposure to Particulate Matter (PM2.5)

We imported the data from the repository where code and data are publicly available for reproducing analyses in “Exposure to air pollution and COVID-19 mortality in the United States: A nationwide cross-sectional study” (Wu et al. 2020). The county-level long-term averaged PM2.5 (\(\mu g/m^3\)) is calculated from an established exposure prediction model ([21], Van Donkelaar et al. 2019). Wu et al. (2020) considered the period of 2000–2016. We extended the period up to 2018 for our analysis. Thus, temporal variations of PM2.5 during our study period for each county were not considered.

2.3 Potential Confounders

The twelve potential risk factors or confounding variables were selected from the previous benchmark study (Wu et al. 2020). The list of variables includes percent of poverty, population density (4 levels by quartiles), median house value (thousand $), median household income (thousand $), percent of owner-occupied housing, percent of Hispanic population, percent of Black population, percent of the adult population with less than a high school education, and percent of the adult population older than age 65, percent of hospital beds per population and percent of smoking. These were collected from the 2000 and 2010 Census (https://www.census.gov) and the 2005–2016 American Community Surveys (https://www.census.gov/programs-surveys/acs/) according to Wu et al. (2020). Note that these variables do not vary temporally but only spatially.

3 Methods

In this paper, we will explore spatiotemporal modeling of the count data for our main analysis. We account for the spatiotemporal nature of the county-level weekly COVID-19 death cases over the regions introduced in Sect. 2.

Depending on the over-dispersion and zero inflation characteristics in the outcome, we consider a negative binomial (NB) and a zero-inflated negative binomial (ZINB) distribution-based modeling approach. We will compare these results with the spatial model results using cumulative counts to provide general understanding of our implications of findings.

3.1 Models

3.1.1 Spatial Negative Binomial Model

Let \(y_{i}\) represent the death counts in county i for a certain week or the aggregated death count for a certain period. We model these cross-sectional spatial count data with a negative binomial distribution (Neelon et al. 2019) specified as,

$$\begin{aligned} y_{i} \sim \mathrm {NB}\left( p_{i}, r\right) \end{aligned}$$
(1)

where \(p_{i}\) represents the success probability in the negative binomial distribution for county i and r controls dispersion of the model since \(var(y_i)=E(y_i)(1+E(y_i)/r)\). We model \(y_{i}\) as a generalized linear mixed model with a logistic link function. We assume D fixed-effect covariates including exposure to PM2.5, population density, age distribution and several socioeconomic variables. County-specific spatial random intercepts, \(b_i\), are introduced to allow spatial dependence among the counties. The model is then defined as:

$$\begin{aligned} \begin{aligned} {\text {logit}}\left( p_{i}\right)&= \theta _{i} = {\text {log}}(x_{oi}) + \beta _0 + \sum \limits _{d=1}^{D} \beta _{d} X_{d} +b_{i}. \\ \end{aligned} \end{aligned}$$
(2)

\(x_{o\,i}\) is a population size of the \(i^{\text {th}}\) county so that \({\text {log}}(x_{o\,i})\) indicates an offset variable. Note that a positive value of \(\beta _d\) associated with a unit change in \(X_d\) accounts for an increase in the expected number of counts. We assign \(\{b_{i}\}\) an intrinsic conditional autoregressive (ICAR) prior (Banerjee et al. 2014), which is specified by the following conditional structure:

$$\begin{aligned} b_{i} \mid b_{(-i)}, \sigma _{b}^2 \sim \mathrm {N} \left( \frac{1}{m_{i}} \sum _{l \in \partial _{i}} b_{l}, \frac{\sigma _{b}^2}{m_{i}} \right) , \end{aligned}$$
(3)

where \(m_i\) is the number of neighbors of the \(i^{\text {th}}\) county, \(\partial _{i}\) is the set of indices for the neighbors of the \(i^{\text {th}}\) county and \(b_{(-i)}\) is the set of random intercepts except the one for the \(i^{\text {th}}\) county. \(\sigma _{b}^2/m_i\) represents the conditional variance given the random intercepts corresponding to the rest of the counties. Note that we assume the first-order neighbor structure. This model is similar to the model considered in Wu et al. (2020), but we consider spatial random intercepts. This model is used to compare with a spatiotemporal model. In the rest of the paper, we will refer to this negative binomial spatial model by SNB.

3.1.2 Spatiotemporal Negative Binomial Model

The NB distribution-based modeling of spatiotemporal count data can be expressed as:

$$\begin{aligned} y_{ij} \sim \mathrm {NB}\left( p_{ij}, r\right) , \end{aligned}$$
(4)

where \(y_{ij}\) and \(p_{ij}\) are the count and the success probability corresponding to the negative binomial distribution for the \(i^{\text {th}}\) county at time \(t_{ij}\), \(j = 1, \ldots , n_i\), respectively. We model \(p_{ij}\) as a generalized linear mixed model with the logistic link function in a following way:

$$\begin{aligned} \begin{aligned} {\text {logit}}\left( p_{ij}\right)&= \theta _{ij} = {\text {log}}(x_{o\,ij}) + \beta _0 + \sum \limits _{m=1}^{M}\beta _{1m}T_m + \sum \limits _{d=1}^{D} \beta _{d+M} X_{d} +b_{i1} + b_{i2} t_{ij}. \\ \end{aligned} \end{aligned}$$
(5)

\(x_{o\,ij}\) is a population size of the \(i^{\text {th}}\) county at the time \(t_{ij}\) so that \({\text {log}}(x_{o\,ij})\) indicates an offset variable. \(\sum \limits _{m=1}^{M}\beta _{1m}T_m\) is a flexible nonlinear time-fixed effect using M cubic B-splines to capture the time trend fully. \(\varvec{\beta } = \left( \beta _0, \beta _1, \beta _2, \ldots \beta _{M+D}\right) ^T\) represents the coefficient vector for the fixed effect covariates. \({\varvec{b}}_i = \left( b_{i1}, b_{i2} \right) ^T\) represents the spatial bivariate random effects for the \(i^{\text {th}}\) county. \(b_{i1}\) is a random intercept and \(b_{i2}\) is a random slope for a linear time trend. In this model, we account for time trend both as a fixed effects and also as a random effects because there might be heterogeneity across counties in the temporal dynamic of COVID-19 weekly deaths that remained unexplained by the time invariant covariates.

We modeled the county-specific random effects vector \({\varvec{b}}_i \), as a bivariate ICAR prior to incorporate spatial dependence in the intercept and slope of the linear county-specific time trends:

$$\begin{aligned} {\varvec{b}}_{i} \mid {\varvec{b}}_{(-i)}, \sigma _{b}^2 \sim \mathrm {N}_{2} \left( \frac{1}{m_{i}} \sum _{l \in \partial _{i}} \varvec{b_{l}}, \frac{1}{m_{i}} \varvec{\Gamma } \right) , \end{aligned}$$
(6)

where \(m_i\) is the number of neighbors for the \(i^{\text {th}}\) county and \(\partial _{i}\) is the set of neighbors for the \(i^{\text {th}}\) county. \(\varvec{\Gamma }/m_i\) is a \(2 \times 2\) conditional covariance matrix of \({\varvec{b}}_i\) given \({\varvec{b}}_{(-i)}\), the random effects for the rest of the counties. We shall refer to this model as STNB in the rest of this paper.

3.1.3 Spatiotemporal Zero-Inflated Negative Binomial Model

In order to explain the zero inflation in the count data across different counties over time, we consider a spatiotemporal ZINB model (Neelon et al. 2019) for \(y_{ij}\) as,

$$\begin{aligned} y_{ij} \sim \left( 1-q_{ij}\right) \mathbbm {1}_{\left( \mathrm {w}_{ij}=0 \wedge y_{ij}=0\right) } + q_{ij} \mathrm {NB}\left( p_{ij}, r\right) \mathbbm {1}_{\left( \mathrm {w}_{ij}=1\right) }, \end{aligned}$$
(7)

where \(q_{ij}\) represents the probability that the \(i^{\text {th}}\) county at time \(t_{ij}\) belongs to the negative binomial component and \(w_{ij}\) represents the corresponding indicator variable. \(\wedge \) is a symbol of logical conjunction. We can interpret \(q_{ij}\) as the probability that the \(i^{\text {th}}\) county at time \(t_{ij}\) potentially can have death counts. The rest of the parameters are same as defined in (4). To consider both a population size offset and nonlinear time-fixed effect similar to the model in (5), we model \(q_{ij}\) and \(p_{ij}\) as,

$$\begin{aligned} \begin{aligned} {\text {logit}}\left( q_{ij}\right)&= {\text {logit}} \left[ {\text {Pr}}\left( \mathrm {w}_{ij}=1 \mid \varvec{\beta }_{1}, {\varvec{b}}_{1i}\right) \right] \\&= \theta _{1 ij} = {\text {log}}(x_{oij}) + \beta _{10} + \sum \limits _{m=1}^{M}\beta _{1m}T_m + \sum \limits _{d=1}^{D_1} \beta _{1(d+M)} Z_{d} + b_{1i1} + b_{1i2} t_{ij} ,\\ {\text {logit}}\left( p_{ij}\right)&= \theta _{2 ij} = {\text {log}}(x_{oij}) + \beta _{20} + \sum \limits _{m=1}^{M}\beta _{2m}T_m + \sum \limits _{d=1}^{D_2} \beta _{2(d+M)} X_{d} + b_{2i1} + b_{2i2} t_{ij} \\ \end{aligned} \end{aligned}$$
(8)

where \({\varvec{b}}_{1i} = \left( b_{1i1}, b_{1i2} \right) ^T\) and \({\varvec{b}}_{2i} = \left( b_{2i1}, b_{2i2} \right) ^T\) represent the random effects corresponding to \(q_{ij}\) and \(p_{ij}\), respectively. We impose a multivariate ICAR prior structure (Neelon et al. 2019) on \(\varvec{\phi }_i = \left( {\varvec{b}}_{1i}^T, {\varvec{b}}_{2i}^T \right) ^T\) as,

$$\begin{aligned} \varvec{\phi }_{i} \mid \varvec{\phi }_{(-i)}, \varvec{\Gamma } \sim \mathrm {N}_{4} \left( \frac{1}{m_{i}} \sum _{l \in \partial _{i}} \varvec{\phi }_{l}, \frac{1}{m_{i}} \varvec{\Gamma }\right) , \end{aligned}$$
(9)

where \(\varvec{\Gamma }/m_i\) is a \(4 \times 4\) conditional covariance matrix. This multivariate ICAR allows spatial dependence and county-specific random time trend within count components and excess-zero components as well as dependence between them. This model will be referred to as STZINB-NLT in the rest of this paper, where NLT refers to nonlinear fixed-time trend for \(q_{ij}\).

We hypothesize that the assumption of nonlinear fixed time trend for \(q_{ij}\) might not be necessary in the sense of model parsimony. Thus, we consider a simpler version by assuming linear time fixed effect in a binary component. That is,

$$\begin{aligned} \begin{aligned} {\text {logit}}\left( q_{ij}\right)&= \theta _{1 ij} = {\text {log}}(x_{oij}) + \beta _{10} + \beta _{11} t_{ij} + \sum \limits _{d=1}^{D_1} \beta _{1(d+1)} Z_{d} + b_{1i1} + b_{1i2} t_{ij}. \end{aligned} \end{aligned}$$
(10)

This model will be referred to as STZINB-LT in the rest of this paper, where LT refers to linear fixed time trend for \(q_{ij}\).

3.2 Bayesian Inference

3.2.1 Prior Specification and MCMC Settings

We illustrate prior and hyper-parameter specifications and Markov Chain Monte Carlo (MCMC) settings under the STZINB model framework. For the latent at-risk indicators \(w_{ij}\), probability was given as \(\exp (\theta _{1ij})/[1+\exp (\theta _{1ij})]\), where \(\theta _{1ij}\) is defined as either equation (8) or (10). Prior distributions for \(\varvec{\beta }_1\) and \(\varvec{\beta }_2\) were assumed to be \(N_p\left( \beta _0 = \varvec{0}, \Sigma _0 = 100{\mathcal {I}}_p\right) \), respectively, where \({\mathcal {I}}_p\) is a \(p \times p\) identity matrix. For r in negative binomial component, a uniform prior was considered. To construct time-basis functions \(T_m, m=1,\cdots ,M\), we standardized time points \(t_{ij}\) to be ranged from 0 to 1, i.e., \(t_{ij} = \frac{j}{J}\), where \(J=23\) is the number of weeks during the study period, and \(j =1,\cdots ,J=23\) is a week indicator. We set three internal knot points 0.25, 0.50, 0.75 considering 0 and 1 as boundaries, so that \(M=7\) throughout our analysis. DIC is calculated as described in Gelman et al. (2013) for model comparison.

For each model, we ran three MCMC chains with 11,000 iterations, and 1,000 burn-in. For each model, convergence of each model was determined by conventional MCMC diagnostics such as trace plots and Geweke z-statistics.

3.2.2 Conditional Posterior Distribution and Model Fitting

A posterior sampling algorithm of STZINB is adopted from Neelon et al. (2019), and it is straightforward to implement the algorithms for the other models as they are simpler models. As outlined in Neelon et al. (2019), we need to update at-risk indicators \(\varvec{w}\), coefficients for the binary model component \(\varvec{\beta }_1\), coefficients for the count model component \(\varvec{\beta }_2\), a dispersion parameter r for the negative binomial distribution, the set of spatial random effects \(\varvec{\phi }\) with \(\Gamma \). The following illustrate the steps of MCMC to update the parameters.

STEP1:

Update at-risk indicators \(\varvec{w}\)

Given current parameter values, we draw \(w_{ij}\) from a Bernoulli distribution with probability \(\eta _{ij}\) such that

$$\begin{aligned} \eta _{ij}= & {} \frac{{\text {Pr}}(y_{ij}=0|w_{ij}=1){\text {Pr}}(w_{ij}=1)}{{\text {Pr}}(y_{ij}=0|w_{ij}=1){\text {Pr}}(w_{ij}=1)+{\text {Pr}}(y_{ij}=0|w_{ij}=0){\text {Pr}}(w_{ij}=0)}\nonumber \\= & {} \frac{q_{ij}^r(1-p_{ij})}{q_{ij}^r(1-p_{ij})+p_{ij}} \end{aligned}$$
(11)

where \(q_{ij}\) and \(p_{ij}\) are defined in (8). Note that \(q_{ij}\) is the inverse logit of \(\theta _{1ij}\), and \(p_{ij}\) is the inverse logit of \(\theta _{2ij}\). To avoid numerical issue, we adjusted the sampled \(q_{ij}\) and \(p_{ij}\) to be within (0.001, 0.999), respectively, in practice.

STEP2:

Update \(\varvec{\beta }= \left( \varvec{\beta }_1, \varvec{\beta }_2\right) \)

To update \(\varvec{\beta }_1\), we draw a latent variable \(\xi _{1ij}\) from a Pólya-Gamma distribution \({\text {PG}}(1,\theta _{1ij})\) as shown in Polson et al. (2013). Given \(\varvec{w}\) and \(\varvec{\xi }_1\), the full conditional distribution of \(\varvec{\beta }_1\) is

$$\begin{aligned} \begin{aligned} {\text {Pr}}\left( \varvec{\beta }_1 | \varvec{w}, \varvec{\xi }_1 \right) \propto \pi (\varvec{\beta }_1) \exp \left[ -\frac{1}{2}(\varvec{z}_1-\varvec{X} \varvec{\beta }_1)^T\Omega _1(\varvec{z}_1-\varvec{X} \varvec{\beta }_1)\right] \end{aligned} \end{aligned}$$
(12)

where X is a \(n \times p\) design matrix, \(\pi (\varvec{\beta }_1)\) the prior distribution \(N_p\left( \beta _0, \Sigma _0\right) \), \(\varvec{z}_1 = \frac{\varvec{w}-1/2}{\varvec{\xi }_1}\), and \(\Omega _1 = {\text {diag}}(\varvec{\xi }_1)\) an \(n \times n\) precision matrix. Conditional on \(\varvec{z}_1\), we update \(\varvec{\beta }_1\) from \(N_p\left( \mu , \Sigma \right) \) where \(\Sigma = \left( \Sigma _0^{-1} + X^T\Omega _1 X \right) ^{-1}\), and \(\mu = \Sigma \left( \Sigma _0^{-1}\beta _0 + X^T\Omega _1 \varvec{z}_1\right) \). We update \(\varvec{\beta }_2\) by similar process using the corresponding Pólya-Gamma distribution \({\text {PG}}(y_{ij}+r,\theta _{2ij})\) as shown in Pillow and Scott (2012).

STEP3:

Update r

We can use a random-walk Metropolis–Hastings step illustrated in Neelon et al. (2019). We present Metropolis–Hastings method with uniform prior because of efficiency in computation time.

STEP4:

Update \(\varvec{\phi }\), \(\Gamma \)

Let \(\varvec{\phi }_{11} = (b_{111},\cdots ,b_{1n1})^{T}\) be the \(n \times 1\) vectors of random intercepts for the binary component, \(\varvec{\phi }_{12} = (b_{112},\cdots ,b_{1n2})^{T}\) be the \(n \times 1\) vectors of random slopes for the binary component, \(\varvec{\phi }_{21} = (b_{211},\cdots ,b_{2n1})^{T}\) be the \(n \times 1\) vectors of random intercepts for the count component, and \(\varvec{\phi }_{22} = (b_{212},\cdots ,b_{2n2})^{T}\) be the \(n \times 1\) vectors of random slopes for the binary component. Then, \(\varvec{\phi } = (\varvec{\phi }_{11},\varvec{\phi }_{12},\varvec{\phi }_{21},\varvec{\phi }_{22})^{T}\) is the \(4n \times 1\) collection of all random effects by definition. Under the STZINB-NLT model illustrated in Sect. (3.1.3), the conditional prior for \(\varvec{\phi }_{11}\), for instance, is

$$\begin{aligned} \begin{aligned} {\text {Pr}}(\varvec{\phi }_{11}|\varvec{\phi }_{12},\varvec{\phi }_{21},\varvec{\phi }_{22},\Gamma ) \propto \exp \left[ -\frac{1}{2}(\varvec{\phi }_{11}-\varvec{\mu }_{11})^T\Sigma _{11}(\varvec{\phi }_{11}-\varvec{\mu }_{11})\right] \end{aligned} \end{aligned}$$
(13)

where \(\Sigma _{11} = \left[ \Gamma _{11} - \Gamma _{1,-1}\Gamma ^{-1}_{-1,-1}\Gamma _{-1,1}\right] ^{-1}Q\), \(\varvec{\mu }_{11} = \left[ \left( \Gamma _{1,-1}\Gamma ^{-1}_{-1,-1}\right) \otimes I_n \right] \varvec{\phi }_{(-1)}\), \(\Gamma _{1,-1}\), \(\Gamma _11\) denotes the first element of \(\Gamma \), \(\Gamma _{(1,-1)}\) is the \(1 \times 3\) vector comprising the first row of \(\Gamma \) with element 1 removed, \(\Gamma _{(-1,-1)}\) is the \(3 \times 3\) sub-matrix of \(\Gamma \) after removing row 1 and column 1, \(Q = M - A\), \(M={\text {diag}}(m_1,\cdots ,m_n)\) an \(n \times n\) matrix with diagonal elements equal to the number of neighbors for each spatial unit, A is an \(n \times n\) adjacency matrix with \(a_{ii}=0\), \(a_{il}=1\) if county units i and l are neighbors, and \(a_{il}=0\) otherwise. We update each vector of \(\varvec{\phi } = (\varvec{\phi }_{11},\varvec{\phi }_{12},\varvec{\phi }_{21},\varvec{\phi }_{22})^{T}\) from its normal full conditional distribution based on (13) applying sum-to-zero constraints as needed. To update \(\Gamma \), we use its conjugate prior, an inverse-Wishart full conditional distribution.

4 Results

4.1 Overall Description

Table 1 Summary statistics for four regions: Mid-Atlantic (New Jersey, New York, and Pennsylvania), Pacific (California, Oregon, and Washington), South Atlantic (Florida, Georgia, North Carolina, South Carolina), and Midwest (Iowa, Kansas, Missouri, Nebraska, North Dakota, and South Dakota)

Table 1 shows summary statistics of our data for the four regions in the USA: Mid-Atlantic (New Jersey, New York, and Pennsylvania), Pacific (California, Oregon, and Washington), South Atlantic (Florida, Georgia, North Carolina, South Carolina), and Midwest (Iowa, Kansas, Missouri, Nebraska, North Dakota, and South Dakota). For example, 55.0 for Mid-Atlantic in the row of the average of zero death proportions means on average 55.0% of counties in Mid-Atlantic region have zero deaths during the 23 weeks. In the row of the average of weekly death counts, 16.4 for Mid-Atlantic means on average 16.4 deaths for each county during a week.

The sample variance of COVID-19 weekly death counts exceeds the sample mean for all the four regions in Table 1. This indicates that the negative binomial distribution, which can accommodate overdispersion, would be suitable in modeling COVID-19 weekly death counts. High proportion of zero counts may not be fully explained by the negative binomial distribution alone. Therefore, zero-inflated negative binomial (ZINB) models could be suitable to account for the excess zero counts as well as the overdispersion. This modeling allows that the observed zero would come from two different sources: structural zero and zero from the negative binomial component.

Fig. 1
figure 1

Bar plots of weekly COVID-19 death counts during the study period (2020/03/23–2020/08/31). In the first row, the height of each bar represents the proportion of counties with no death count (0–100%) in each week. A red dashed line is the global average of zero rates across time (week). In the second row, the height of each bar represents the weekly death counts averaged across all counties with regions. Please note that the ranges of y-axis are different among the four regions

The four regions have different characteristics of COVID-19 death counts until August. Figure 1 illustrates two types of bar plots representing the weekly rates of zeros and the weekly death counts averaged across counties within regions. Midwest has the highest zero rates over time. Pacific and South Atlantic regions have decreasing zero rates over time in general. The bar plot of the weekly death counts for both regions shows bimodal shapes with a large peak in early August. On the other hand, Midwest has bimodal shape but the peak is in early May. Different from the other three regions, Mid-Atlantic is left-skewed with a peak in April.

Table 2 Average and standard deviation (SD) of 23 correlation values calculated between logarithm of COVID-19 weekly death counts per capita and PM2.5 over counties in each week for four regions. A zero count is replaced with 0.5 when calculating correlation. The row of percentage of significance provides the percentage of significant correlation values (different from zero) based on a t-test with a 5% significance level

To investigate possible association between COVID-19 death counts and PM2.5, we compute correlation between log of COVID-19 weekly death counts per capita and PM2.5 over counties in each region for each week. With 23 weeks of consideration, we have 23 correlation values for each region. Note that all correlation values are positive. In Table 2, we provide average and standard deviation of these 23 values for each region. Also, we provide the percentage of significant correlation based on a t-test for nonzero correlation with 5% significance level. Majority of weeks has statistically significant nonzero correlation values except South Atlantic. From this investigation, we hypothesized that the COVID-19 counts and long-term PM2.5 exposure were positively associated.

4.2 Estimation Results of the Models on COVID-19 Mortality

Table 3 Point estimates and 95% credible intervals of the fixed effects for the expected COVID-19 death counts of the model chosen by DIC in each region. Results are obtained by fitting a STZINB-NLT given in (8) for Mid-Atlantic and Pacific and by fitting a STZINB-LT given in (10) for South Atlantic and Midwest. The bold numbers indicate statistical significance

Among the Bayesian spatiotemporal models such as STNB, STZINB-LT, and STZINB-NLT, one model that shows the lowest Deviance information criterion (DIC) is chosen for each of the four regions. STZINB-NLT is chosen for Mid-Atlantic and Pacific regions, while STZINB-LT is chosen for South-Atlantic and Midwest regions. Note that we include a nonzero-inflated model for comparison and the zero-inflated models were selected for all the regions we considered in this study with given study period. The DIC values for the models we considered are provided in Tables 4, 6, 8 and 10 in the Supplementary document.

Table 3 describes the estimated coefficients and their 95% credible intervals of the covariates on COVID-19 weekly death counts under the selected model. The results for the four regions indicate that long-term exposure to PM2.5 is positively associated with the expected COVID-19 weekly death counts, but only Pacific region shows lack of significance based on the 95% credible interval. The point estimate of \(\beta _{2,2}\) that corresponds to PM2.5 for the Mid-Atlantic region is equal to 0.069, which is the change in the expected COVID-19 weekly death counts in log scale by a unit change in PM2.5. That is, an increase of 1 \(\mu g/m^3\) in the long-term PM2.5 level is associated with a \(e^{0.069} - 1 \simeq 7.1\%\) increase in the expected COVID-19 weekly death counts per county after controlling all the confounding factors. Similarly, we have 2.3% for Pacific, 3.1% for South Atlantic and 9.2% for Midwest of increments in the expected COVID-19 weekly death counts per county.

Increases in MHI, Hispanic population and number of beds are associated with the increases in the expected COVID-19 death counts, although some are not significant based on 95% credible intervals. MHV is negatively associated except Mid-Atlantic region, while Black population and Smoking rate are mostly positively associated except Pacific region for Black population and South Atlantic for smoking rate. The other confounders show mixed directions of the effects.

Table 4 Point estimates and 95% credible intervals of the fixed effects on \(q_{ij}\), (the probability that belongs to the negative binomial component). Results are obtained by fitting a STZINB-NLT given in (8) for Mid-Atlantic and Pacific and by fitting a STZINB-LT given in (10) for South-Atlantic and Midwest. The bold numbers indicate statistical significance

Table 4 illustrates the estimated coefficients and their 95% credible intervals of the covariates on \(q_{ij}\), the probability that belongs to the negative binomial component under the selected model. The estimated coefficients for long-term ambient PM2.5 are positive for all regions except for Pacific region. This implies that the increase in the level of long-term ambient PM2.5 can potentially increase the chance of COVID-19 death since \(q_{ij}\) is the probability that belongs to the negative binomial component which models nonzero death counts. The effect of the long-term ambient PM2.5 on \(q_{ij}\) is statistically significant for Mid-Atlantic and Midwest regions. If the long-term ambient PM2.5 increases in 1 \(\mu g/m^3\) adjusting the other confounding factors in Midwest region, for example, the log odds ratio for \(q_{ij}\) increases by 0.355. This implies that if the long-term ambient PM2.5 changes from 8 \(\mu g/m^3\) to 9 \(\mu g/m^3\), \(q_{ij}\) will increase in \(\exp (9\cdot 0.355)/[1+\exp (9\cdot 0.355)] - \exp (8\cdot 0.355)/[1+\exp (8\cdot 0.355)] \simeq 0.016\). On the last week of the study period, \(q_{ij}\) in Midwest region ranges from 0.164 (Jewell County) to 0.999 (Johnson County).

The most estimates for MHI, Hispanic population and number of beds are positive, which are similar to the results in Table 3, but they are mostly not significant. For the other confounders, the directions of the effects for the confounders are rather different and compared to the results in Table 3, and the effects are mostly not significant.

Fig. 2
figure 2

Visualization of time-averaged \(q_{ij}\) using the selected models for the four regions. The first two rows show the time-averaged \(q_{ij}\) over counties. The last row show twelve counties whose estimated \(q_{ij}\)s correspond to the \(\frac{j}{11}\)th quantiles for \(j=0,1,\cdots , 11\)

Figure 2 shows the time-averaged \(q_{ij}\). Note that \(q_{ij}\) represents the probability of the \(i^{th}\) county being in the negative binomial component at the j-th week. The dark colored counties for each region indicate a high time-averaged probability in the first row. In the second row, we show 12 counties whose estimated \(q_{ij}\)s correspond to the \(\frac{j}{11}\)th quantiles for \(j=0,1,\cdots , 11\). The differences between the first and the second counties are relatively larger in Mid-Atlantic and Pacific regions than those in the other two regions. The decreasing rate over counties is slower in South Atlantic and Midwest regions compared to the other regions.

The estimated \(\Gamma \), the covariance matrix of four spatial random effects in each region shows that off-diagonal entries are close to zero (Table 3 in the Supplementary document). Recall that these random effects are random intercepts and random slopes for time in the count component and excess zero component. Thus, estimated zero implies these random effects are not correlated to each other, although each random effect is spatially dependent.

Fig. 3
figure 3

The median value of estimated nonlinear mixed time effect in the negative binomial component (the COVID-19 death counts) using the selected model structure. The x-axis represents timeline (weeks). We show five representative counties in each region for visualization. A red dash line is a zero reference. The ranges of y-axis are different among the four regions (Color figure online)

Figure 3 shows the estimated nonlinear mixed time effects in the negative binomial component (the COVID-19 weekly death counts). Each of the four regions shows different nonlinear temporal patterns in the negative binomial component. We selected five representative counties out of twelve appeared in Fig. 2 considering the rank of the estimated \(q_{ij}\), and they show distinctive patterns over study periods since we allow county-specific random effects in the intercept and linear components. The temporal patterns are overall similar to the patterns we observed in Fig. 1. This is expected since the covariates in the models are static so that the time effect components in the models try to capture the temporal patterns unexplained by the static covariates. The estimated effects of the models were not much sensitive by a different starting day of a week when aggregating COVID-19 daily death counts to weekly death counts (results not shown).

4.3 Comparison by Modeling Techniques

In this section, we compare our spatiotemporal models with a spatial negative binomial model (SNB) with cumulative death counts. The results are provided in Table 5. Since the models are different as well as the response variables are different, we cannot directly compare the results between the SNB model and spatiotemporal models. However, we can assess the direction of associations and whether they are consistent or not between these models. Under the SNB model, we found that the long-term exposure to PM2.5 is positively associated with the expected cumulative death counts for COVID-19 for all four regions. This is consistent with the results from the spatiotemporal models provided in the previous subsection and also with the results by Wu et al. (2020). This supports the hypothesis of positive association between the long-term ambient PM2.5 and the expected COVID-19 death counts. On the other hand, the results are different in some aspects. Note that the effect size of long-term ambient PM2.5 estimated from the spatial model exceeds those from spatiotemporal models. We suspect that this could be due to the lack of temporal components in the model. Also, the directions of the association for some confounders are not the same and there are less number of significant effects.

The estimated effects are rather sensitive to the length of the period for aggregating the outcome variable (results not shown), which could be an issue to consider a SNB model for COVID-19 death counts. In addition to this issue, a SNB model ignores temporal characteristics of the data. Thus, it is natural to consider a spatiotemporal model, but we should be careful in interpreting the results from the spatiotemporal models since the model complexity can result in an overfitting or unstable estimation.

Table 5 The estimation results of effects on the cumulative death counts of COVID-19 using the spatial negative binomial model (SNB model; Sect. 3.1.1) for the four regions (Mid-Atlantic, Pacific, South Atlantic, and Midwest). The counts are aggregated from March 23, 2020 to August 31, 2020. The bold numbers indicate statistical significance

5 Conclusion and Discussion

We investigate the relationship between long-term exposure to PM2.5 and county-level COVID-19 weekly death counts by implementing and comparing several spatiotemporal negative binomial models with/without a zero-inflated component. These associations were adjusted by social and environmental factors. We also considered county-level random effects that account for spatiotemporal interaction in both the structural zero component and the negative binomial component via an ICAR model. We considered possible nonlinear time effects in both components as well.

We hypothesize potential heterogeneity in the effects of long-term exposure to PM2.5 and other social and environmental factors on the COVID-19 weekly death counts across divisions of the USA, likely due to different region-specific sociocultural, behavioral and healthcare system as well as COVID-19 policies by assuming that nearby states have similar characteristics. Thus, we consider four geographically different regions (Mid-Atlantic, Pacific, South Atlantic, and Midwest) and applied the spatiotemporal models to each region.

Based on model comparison by DIC, we selected zero-inflated models for all four regions. Within zero-inflated models, the linear time trend model for the probability that belongs to the negative binomial component was chosen for South Atlantic and Midwest regions, while the nonlinear time trend model was chosen for the other two regions. Note that we assumed nonlinear time trend for the negative binomial component for all four regions.

We compare the results obtained from these spatiotemporal models with the results obtained from a spatial negative binomial model that completely ignores the temporal information. The spatial negative binomial model was applied to the cumulative death counts until the date we considered (August 31, 2020). Because the spatiotemporal model uses weekly COVID-19 counts as outcome and the spatial model uses cumulative death counts for the whole study period as outcome, the results obtained under the two models have different interpretations. Still we found that the direction and the strength of the associations between PM2.5 and COVID-19 death counts are consistent.

The estimated coefficients associated with long-term exposure to PM2.5 from the selected models are mostly positive and statistically significant for the regions under this study after adjusting the nonlinear time trend with a county-specific random slope, spatial dependence, and many other measured confounders. The directions of association for COVID-19 weekly death counts, although not significant for Pacific, are consistent with the result of the previous study (Wu et al. 2020). Note that Wu et al. (2020) did not consider spatial and temporal dependence in the model. We also checked the effects of long-term exposure to PM2.5 for all the models introduced in Sect. 3.1 and the model from Wu et al. (2020) using the data used in this study. All the models show the positive association. The results are given in Tables 5, 7, 9 and 11 in the Supplementary document.

Some of confounders may affect on COVID-19 weekly death counts through an interaction effect. Thus, we investigate effects by an interaction term between PM2.5 and other confounders, but they are not critical for the models and data sets we considered. These findings add evidence of the increase in risk of death for COVID-19 by the long-term exposure to air pollution into the literature.

By aggregating the data over time or region, we may encounter ecological fallacy and need to be careful in interpreting the result. That is, the increased effects on COVID-19 death counts by PM2.5 that we observed from our analysis may not imply an increased risk of individual’s death by COVID-19 since the analysis is based on a county-level weekly aggregated death counts. Our results only imply that long-term exposure to PM2.5 may increase COVID-19 weekly death counts at a county level.

By applying the models to each region, separately, we are able to see different association patterns among regions. Although the effect of the long-term exposure to PM2.5 on the COVID-19 weekly death counts is in the same direction for all four regions, the size of the effects is different. Also, the effects of the some other confounders show different directions by region. The proposed models, spatiotemporal zero-inflated negative binomial models with nonlinear time effects, spatial random effects and spatiotemporal interaction random effects, are very flexible and capture the spatiotemporal characteristics of the data. On the other hand, the complexity of the model could lead to an increased variability in estimation due to the large number of parameters to estimate. This issue could be alleviated by controlling the prior distribution with the information from the previous study.

To support our claim about our methodology and modeling strategy, we have done extensive simulation studies which mimics our specific spatiotemporal data structure with nonlinear temporal effects. Full details of our simulation studies are included in the Supplementary document. We have developed a user-friendly R tool. All model-related R codes and software can be downloaded from GitHub repository (https://github.com/junpeea). We are also in the process of developing a R-Shiny-based application for cloud-based deployment and interactive interface for non-statistician’s easy use and access.

As the spread of COVID-19 is ongoing, geographically different regions have different dynamics of the disease spread and a spatiotemporal model is flexible enough to handle different types of dynamics. As the surge of COVID-19, the zero-inflated model might not be suitable for many states anymore. However, by investigating and comparing several spatiotemporal models including nonzero-inflated models, we can find a reasonable model that explains the characteristics of the data. The risk factor and confounders we used are not temporally varying although the response variable, COVID-19 weekly death counts, is varying over time. Temporal variations in the response variable would be due to unavailable temporal covariates such as policy changes by county health department on COVID-19 and policy changes for school opening and business operation. To handle temporal variations without such temporal covariates, we introduced nonlinear time effects with county-specific random slopes. Once this additional information is available, we can easily accommodate it into the model.

One can consider applying the spatiotemporal model to county-level COVID-19 weekly death counts for the entire USA. This can be doable but a single model may not be able to capture heterogeneous effects that we found in this study. Also, handling a spatial dependence model with a large number of spatial regions brings an additional computational burden. On the other hand, we can extend the current model with a spatiotemporally varying coefficient model to capture heterogeneous effects by states or by county so that we can investigate the whole data into one model framework. A modified spatial dependence modeling such as allowing spatial dependence only within state to increase computational efficiency can be also considered. We plan to investigate such extension as a future study.

The effect of PM2.5 on individuals’ health has been investigated in both long-term and short-term directions in the literature. COVID-19 changed economic environment as well as people’s life styles in many ways, which cause large variations (either up or down) in PM levels during a short-term period (Wu et al. 2020; Venter et al. 2020) As our focus is long-term effect of PM2.5 on COVID-19 death counts, these short-term changes of PM levels were not considered in our analysis. Thus, our findings are restricted to the association between long exposure to PM2.5 and COVID-19 death counts, in particular, aggregated death counts over counties and weeks.

Finally, as we have mentioned earlier, this study contributes toward the possibility of the hazards of PM2.5 on COVID-19-positive cases and related mortality. Specifically, we believe that as we have incorporated the “zero-inflation” part to the model, it will allow to make inference in early stages of the pandemic of this kind. Nevertheless, we assent that the claim on the higher risk of a COVID-19 death in polluted counties is open to debate as our findings (positive association) do not imply causation. It is also acknowledged that the findings in this paper are limited as they depend on several factors such as the data structure, models and inference methods. There have been about only two years since the coronavirus outbreak began; even so a heavy contribution from several genre of research related to this pandemic—which is a great reflection of the fact that researchers around the world are trying to infer on the connection between the air pollution and its association with the deaths related to COVID-19. However, we are in a urgent need of more research to be done, at the granular level with more complexity in the data and in the modeling. As we are working on this problem, as a statistician and data scientist, it is believed that there is an immediate need of more patient-level data (which is not easy to access because of the issues related to data confidentiality, data sharing, and other related things) than the kind of data publicly available these days. That also makes us believe that as more similar findings are accumulated by other researchers, our claim would become much more stronger.