1 Introduction

In the last decade, data with temporal and spatial attributes are quickly accumulated and form large numbers of spatio-temporal datasets collected in diverse domains, including climate science, social sciences, economics, neuroscience, epidemiology, transportation, mobile health, and Earth sciences. In these domains, the real-world processes being studied are intrinsically spatio-temporal, and several data acquisition methodologies have been proposed to record the spatial and temporal information of every measurement in the data. For example, when dealing with environmental data generated by air quality monitoring sensors, data are simultaneously collected by monitoring stations for different sites and different time points (see Atluri et al. 2018 for a review).

In all these fields, missing data are pervasive as it often happens that, for several reasons including equipment failure or measurement errors, (possibly) long sequences of data are not correctly recorded. If these holes in data cannot be accurately estimated, the subsequent steps of data analysis and modelling might lead to incorrect results and unreasonable inference. Clearly, merely deleting the records containing missing data cannot be considered a sensible strategy since it would lead to a significant loss of initial information and would be a waste of data resources. Therefore, statistical methods which are able to accurately and efficiently interpolate missing values have been proposed in the literature. Clearly, in the context of spatio-temporal data, the problem of missing data reconstruction becomes even more challenging, given the additional complexity of the spatial and time-dependent structure, which is present in the observed datasets. However, the spatial and time dependency, when correctly modeled, can be effectively exploited to get accurate reconstruction of (possibly) long sequences of missing values.

Studies on the missing values estimation in meteorological time series go back to the 1950s. In the first studies, missing values are estimated by means of imputation or simple linear regression estimates. For the same purpose, in Young (1992) three methods, normal ratio method, multiple discriminant analysis and multiple linear regression are discussed and compared. In Eischeid et al. (2000) several estimation techniques based on spatial analysis schemes are evaluated to create a complete national daily time series of the maximum-minimum temperatures and total precipitation over the western United States. In Teegavarapu et al. (2005) an imputation method based on the inverse distance weighting method, by looking at the distance between target and reference stations, was proposed. The use of a regularized EM algorithm for the imputation of incomplete climate data is recommended in Schneider (2001), especially when the number of the observed series with missing values exceeds the sample size. Another study that proposes multiple imputation is provided in Cano et al. (2010), in which a MCMC-based procedure is suggested. Other appreciable works in the area of metereological time series are carried out by Junninen et al. (2004); Lo Presti et al. (2010); Smith et al. (2007).

Recently, some new approaches have been proposed in the literature, with clear evidence of substantial advantages compared with existing methods (see, Pollice et al. 2009; Liu et al. 2014; Yang et al. 2018 inter alia). Mainly, the new approaches combine two different imputation methods in separate stages (for example, k-NN and Fourier transform), so that the first one accounts for cross-correlation among variables and the second one deals with serial correlation in univariate time series. Recently, in Calculli et al. (2015) the authors proposed a multivariate hidden dynamic geostatistical model and maximum likelihood parameter estimates obtained by using EM algorithm. This approach is able to deal with multiple variables sampled at different monitoring networks and missing data. In Parrella et al. (2019) we proposed a new procedure for estimating (even long) missing sequences in time series, which uses an approach based on the generalized spatial-dynamic autoregressive model. This model was first proposed in Dou et al. (2016) and belongs to the family of spatial econometric models (see Lee et al. 2010 for an introduction and a survey of such models). These models include, in the form of a weighted multivariate autoregression, the distances among the considered locations (i.e., among the monitoring stations). In this way, it is possible to take into account spatial correlation in the data and estimate missing sequences in one given spatial site by looking at near sites, but also by looking at the previous lags of the same station and the neighbour sites.

However, when estimating the missing values for a given site, a point forecast alone is usually not sufficient. A statement about the uncertainty contained in the point forecast, as expressed by some prediction intervals, may also be desired. Moreover, missing values are not only isolated missing points, but often there are (possibly long) missing sequences of data points in spatio-temporal databases. Hence, we usually have a path of missing values for H consecutive time points, for a given site. A path-missing reconstruction refers to the sequence of corresponding missing values imputation for the H missing time points.

On the one hand, one can construct H marginal prediction intervals by using a given method to build a prediction interval repeatedly, one period at a time. But, by design, probability statements then only apply marginally, one period at a time: the prediction interval at a specific time point \(t_{h}\), for some \(t_{1} \le t_{h} \le t_{H}\), will contain the random variable representing the missing value with prespecified probability \(1-\alpha \). The problem has been already addressed in the literature with effective proposals as in Alonso et al. (2008) and Alonso et al. (2013) where a bootstrap scheme is employed to construct accurate interpolation intervals.

On the other hand, a more general problem is the construction of a joint prediction region (JPR) that will contain the entire missing path with the desired probability \(1-\alpha \). Clearly, stringing together marginal prediction intervals for time points \(t_{1}\) up to \(t_{H}\), each one at level \(1-\alpha \), will not result in a JPR that contains the entire missing path with probability \(1-\alpha \). Instead, like the case of prediction intervals, apart from pathological cases, the joint coverage probability of the missing sequence will be strictly less than \(1-\alpha \), and decreasing in H. Despite the importance of the problem just described, the construction of JPRs for missing paths has been somewhat neglected in the literature so far.

In this paper we propose a bootstrap resampling scheme to construct JPRs that contain missing paths of a time series of interest with nominal coverage level \(1-\alpha \), along the same approach used by Wolf et al. (2015) and based on the maximum predictive root. Moreover, if the missing value time horizon H is large, the applied researcher may deem the criterion that all elements of the missing path must be contained in the JPR with probability \(1-\alpha \) as too strict. We also consider the more general problem of constructing JPRs that will only contain all elements of missing paths up to a small number \(k-1\), (\(k=1,2,\ldots \)) of them with probability \(1-\alpha \). The choice of k must be made by the applied researcher, with respect to the given problem at hand. But it will be useful to the applied researcher to have a method available that can handle any desired value of k. In particular, the choice \(k = 1\) yields a standard JPR that must contain all elements of a missing path with probability \(1-\alpha \).

A simulation experiment is performed to validate the empirical performances of the proposed JPRs construction method. For the sake of comparison, we also propose two alternative resampling techniques for the construction of JPRs, based on a simple nominal coverage correction loosely inspired by the Bonferroni approach, which are expected to work well in simple and standard scenarios.

The remainder of this paper is organized as follows. Section 2 contains a short review on the spatio-temporal model used and the missing data reconstruction procedure. Section 3 describes our method to construct JPRs for missing paths in spatio-temporal datasets. Section 4 investigates the finite-sample performance via Monte Carlo simulations. An application to real data is also presented in Sect. 5. Finally, some concluding remarks close the paper.

2 Path missing values reconstruction in spatio-temporal datasets

Let \({\mathbf {y}}_t=(y_{t,1},y_{t,2},\ldots ,y_{t,p})\) be a multivariate stationary process of dimension p, assumed for simplicity with zero mean value vector, generating the observations at time t from p different locations. Following Dou et al. (2016), we assume that the process can be modeled by the following Spatial Dynamic Panel Data (SDPD) model

$$\begin{aligned} {\mathbf {y}}_t = D(\varvec{\lambda }_0){\mathbf {W}}{\mathbf {y}}_t + D(\varvec{\lambda }_1){\mathbf {y}}_{t-1} + D({\varvec{\lambda }_2}){\mathbf {W}}{\mathbf {y}}_{t-1} + {\epsilon }_t, \end{aligned}$$
(1)

where \(D(\cdot )\) are diagonal matrices with diagonal coefficients from the vectors \(\varvec{\lambda }_0, \varvec{\lambda }_1\) and \(\varvec{\lambda }_2\), and the error process \({\epsilon }_t\) is serially uncorrelated, with diagonal heteroskedastik variance-covariance matrix \({\varvec{{\varSigma }}}_\epsilon \). Model (1) belongs to the family of spatial econometric models, so it is particularly oriented to model spatio-temporal data. The matrix \({\mathbf {W}}\) is called spatial matrix and collects the weigths used in the spatial regression of each time series observation with simultaneous or delayed observations of neighboring data. In particular, note that the term \(D(\varvec{\lambda }_0){\mathbf {W}}{\mathbf {y}}_t\) captures the pure spatial effects, since it only considers contemporary observations, the component \(D(\varvec{\lambda }_1){\mathbf {y}}_{t-1}\) captures the pure dynamic effects, since it involves lagged observations, while \(D({\varvec{\lambda }_2}){\mathbf {W}}{\mathbf {y}}_{t-1}\) captures the spatial-dynamic effects. However, if one uses a correlation based matrix \({\mathbf {W}}\) to measure variable distances, instead of using physical distances, one can use model (1) to analyse any kind of multivariate time series, not necessarily of strictly spatial nature. Of course, spatial models rely on the spatial weight matrix \({{\mathbf {W}}}\) to specify the cross-sectional correlation. Currently, there is no well-defined theory on how to find the true spatial matrix for a given data application. Recently, there have been some proposals to make this matrix “endogenous” within the model or to select the best spatial matrix among a list of candidates, in order to increase the flexibility of the spatial model (see, for example, Gao et al. 2019; Qu et al. 2021; Zhang et al. 2018). Of course, some additional assumptions and/or some adjustments to the model parametrization are required for identifiability. In this paper, we assume that matrix \({{\mathbf {W}}}\) is known, as usually done within this class of models, but we defer further investigations on the selection of the “best” matrix \({{\mathbf {W}}}\) for the future.

In the following, we assume that \({\mathbf {y}}_1, \ldots , {\mathbf {y}}_T\) are realizations from the stationary process defined by (1). Then, we denote with \(\varvec{{\varSigma }}_j=\mathop {Cov}({\mathbf {y}}_t,{\mathbf {y}}_{t-j})=E({\mathbf {y}}_t{\mathbf {y}}_{t-j}^\prime )\) the autocovariance matrix of the process at lag j, where the prime superscript denotes the transpose operator.

The parameters of model (1) can be estimated following Dou et al. (2016). In particular, given stationarity, from (1) we derive the Yule-Walker equation system

$$\begin{aligned} \left( {\mathbf {I}}-D\left( \varvec{\lambda }_0\right) {\mathbf {W}}\right) \varvec{{\varSigma }}_1=\left( D\left( {\varvec{\lambda }_1}\right) + D\left( \varvec{\lambda }_2\right) {\mathbf {W}}\right) \varvec{{\varSigma }}_0, \end{aligned}$$

where \({\mathbf {I}}\) is the identity matrix of order p. The i-th row of the equation system is

$$\begin{aligned} \left( {\mathbf {e}}_i^{\prime }-\lambda _{0i}{\mathbf {w}}_i^{\prime }\right) \varvec{{\varSigma }}_1=\left( \lambda _{1i}{\mathbf {e}}_i^{\prime }+ \lambda _{2i}{\mathbf {w}}_i^{\prime }\right) \varvec{{\varSigma }}_0, \quad i=1,\ldots ,p, \end{aligned}$$
(2)

with \({\mathbf {w}}_i\) the i-th row vector of \({\mathbf {W}}\) and \({\mathbf {e}}_i\) the i-th unit vector. Replacing \(\varvec{{\varSigma }}_1\) and \(\varvec{{\varSigma }}_0\) by the sample (auto)covariance matrices

$$\begin{aligned} \widehat{\varvec{{\varSigma }}}_1 = \frac{1}{T} \sum _{t=1}^{T-1} {\mathbf {y}}_{t+1} {\mathbf {y}}_t^\prime \quad \mathrm{and} \quad \widehat{\varvec{{\varSigma }}}_0 = \frac{1}{T} \sum _{t=1}^T {\mathbf {y}}_t{\mathbf {y}}_t^\prime , \end{aligned}$$

the vector \((\lambda _{0i},\lambda _{1i},\lambda _{2i})^{\prime }\) is estimated by the generalized Yule-Walker estimator, available in closed form,

$$\begin{aligned} \left( {{\widehat{\lambda }}}_{0i},{{\widehat{\lambda }}}_{1i},{{\widehat{\lambda }}}_{2i}\right) ^{\prime }= \left( \widehat{{\mathbf {X}}}_i^{\prime }\widehat{{\mathbf {X}}}_i\right) ^{-1}\widehat{{\mathbf {X}}}_i^{\prime }\widehat{{\mathbf {y}}}_i, \quad i=1,2,\ldots ,p, \end{aligned}$$
(3)

where \(\widehat{{\mathbf {X}}}_i=\left( \widehat{\varvec{{\varSigma }}}_1^{\prime }{\mathbf {w}}_i, \widehat{\varvec{{\varSigma }}}_0{\mathbf {e}}_i, \widehat{\varvec{{\varSigma }}}_0{\mathbf {w}}_i\right) \) and \(\widehat{{\mathbf {y}}}_i=\widehat{\varvec{{\varSigma }}}_1^{\prime }{\mathbf {e}}_i\).

Model (1), once estimated, can be used to reconstruct sequences of missing values. To this aim, let us assume that \(\widetilde{{\mathbf {y}}}_1, \cdots , \widetilde{{\mathbf {y}}}_T\) are realizations from a stationary spatio-temporal process with mean value not necessarily equal to zero. Let \(\varvec{\delta }_t=(\delta _{t,1},\ldots ,\delta _{t,p})\) be a vector of zeroes/ones that identifies all the missing values in the observed vector \(\widetilde{{\mathbf {y}}}_t\), so that \(\delta _{t,i}=0\) if the observation \({\widetilde{y}}_{t,i}\) is missing, otherwise it is \(\delta _{t,i}=1\). In case of processes with no zero mean, model (1) can still be used for parameter estimation after a pre-processing step which centers the observed time series, i.e. \({{\mathbf {y}}}_t=\widetilde{{\mathbf {y}}}_t-\bar{{\mathbf {y}}}\). Now, because we have missing values in the observed spatio-temporal series, centering data using the initial mean \(\bar{\mathbf{y}}\) (excluding missing values) is like centering using the wrong mean. As a result, we can have a bias in the estimation results (unless we can assume missing observations happen at random, i.e. they occur independently on the level of the process). To prevent this problem, we include the computation of the mean in the iterative procedure, so that mean-centering is made iteratively on the basis of the whole series (including the imputed values). See Parrella et al. (2019) for more details about this issue.

The imputation procedure is described in details in Algorithm 1.

figure a

Note that the procedure is able to reconstruct both isolated missing values and (possibly long) sequences of missing values. In doing so, the procedure uses both the time dependence structure in each time series and the cross-dependency among time series making the procedure very effective, assuming that the spatio-temporal parametric model is correctly assumed. As an additional remark, note that the model estimation procedure is based on an estimator available in closed form making the overall procedure very convenient from a computational point of view. Therefore, our estimation procedure can still be used efficiently when the number of spatial locations is very large. In fact, note that matrix \(\widehat{{\mathbf {X}}}_i^{\prime }\widehat{{\mathbf {X}}}_i\) in Eq. (3) is always of order \(3\times 3\), whatever the total number of spatial locations is. So, the matrix inversion can be done very quickly. Hence, the estimation procedure requires a loop to estimate the spatial parameters for all spatial units (for \(i=1,\ldots ,p\), where p is the total number of units). This loop can be easily parallelized, in case p is extremely large. This latter aspect is of great importance as we use the bootstrap to approximate the confidence bands, which itself requires additional computational burden.

3 Bootstrap joint prediction regions for missing value paths

Given the observed spatio-temporal series \({\mathbf {y}}_1, \cdots , {\mathbf {y}}_T\), we assume there can be several missing values and/or missing sequences appearing here and there in the multivariate series, at different locations and/or different time intervals. Unlike the previous section, where we used a matrix/vector notation to refer to the multivariate series, in this section we prefer to focus on a single univariate missing sequence (or missing value), in order to keep notation simple as we explain how to derive the JPR for this sequence. Of course, the same arguments must be repeated for all the missing values/sequences in the spatio-temporal series.

So, let \(y_{t,i},y_{t+1,i}, \ldots , y_{t+H^i_t-1,i}\) denote a generic sequence of \(H^i_t\) missing values, starting at time t for a given site i. Here and in the following we refer to a missing sequence of length \(H^i_t\), but note that this notation includes the special case of isolated missing values when \(H^i_t=1\).

Let \({\widehat{y}}_{t,i},{\widehat{y}}_{t+1,i}, \ldots , {\widehat{y}}_{t+H^i_t-1,i}\) be the sequence of predicted values using the procedure of the previous section, based on the multivariate data \({\mathbf {y}}_1, {\mathbf {y}}_2, \ldots , {\mathbf {y}}_T\). Since the predictor \({y}_{t+h,i}\) is a function of the data and of the model with parameters \(\varvec{\theta }=(\varvec{\lambda }_{0},\varvec{\lambda }_{1},\varvec{\lambda }_{2})\) we write it as \({\widehat{y}}_{t+h,i}=g_h({\mathbf {y}}_1, {\mathbf {y}}_2, \ldots , {\mathbf {y}}_T,\widehat{\varvec{\theta }})\), with \(h=0,1,\ldots ,H^i_t-1\).

The aim is to construct a joint prediction region (JPR) that contains the missing path of interest with probability \(1-\alpha \), at least asymptotically. Following Wolf et al. (2015), define the family-wise error rate as

$$\begin{aligned} {\text {FWE} = \Pr \!\left( \text {at least one of}\; y_{t+h,i},h=0,\ldots ,H^i_t-1, \text {is not contained in JPR}\right) }. \end{aligned}$$

When the length of the missing sequence \(H^i_t\) is large, control of the FWE may be deemed too strict as it would cause very large (and therefore uninformative) JPRs. In such a case, we suggest to use the generalized family-wise error rate (\(k\text{- }\)FWE) defined as

$$\begin{aligned} {k\text{- }\text {FWE} = \Pr \!\left( \text {at least}\; k \text { of}\; y_{t+h,i}, h=0,\ldots ,H^i_t-1, \text {are not contained in JPR}\right) }, \end{aligned}$$

with \(k<H^i_t\), providing the applied researcher with an alternative tool in order to relax the control of the FWE.

In the next sections, we describe three different methods to derive the JPR for a missing sequence, following the \(k\text{- }\text {FWE}\) criteria: the Maximum Predictive Root method (MPR), the normal Bootstrap method (NB) and the Percentile method (PER). Our main proposal is the MPR method, while the other two methods are included for the sake of comparison, and used as much simpler alternative techniques that are expected to work reasonably well in standard scenarios. However, it is worth pointing out that all three methods have been proposed here for the first time, and all of them have interesting peculiarities and good empirical performance, as will be shown in the simulation study.

3.1 Joint prediction regions with the Maximum Predictive Root method

Let \({y}_{t+h,i}-{\widehat{y}}_{t+h,i}\), \(h=0,1,\ldots , H^i_t-1\), be the sequence of prediction errors. The Maximum Predictive Root (MPR) method is based on the statistic \(M_{k,H^i_t}\) defined as

$$\begin{aligned} M_{k,H^i_t} = k\text{- }\mathrm{{max}}_{h=0,1,\ldots ,H^i_t-1}\!\left( \big |{y}_{t+h,i} - {\widehat{y}}_{t+h,i}\big |\right) , \end{aligned}$$

where, as in Wolf et al. (2015), we define \(k\text{- }\max _{h=0,1,\ldots ,H^i_t-1}(x_{h})\) as the function that returns the k-th largest value of the set \(\{x_{0},\ldots ,x_{H^i_t-1}\}\). Then a two-sided JPR for the missing sequence, that controls the \(k\text{- }\)FWE in finite samples, is given by:

$$\begin{aligned} \left[ \widehat{y}_{t,i} \pm q^{i,t,k}_{(1-\alpha )}\right] \times \left[ \widehat{y}_{t+1,i} \pm q^{i,t,k}_{(1-\alpha )}\right] \times \cdots \times \left[ \widehat{y}_{t+H^i_t-1,i} \pm q^{i,t,k}_{(1-\alpha )}\right] \end{aligned}$$
(7)

where \(q^{i,t,k}_{(1-\alpha )}\) is the \(1-\alpha \) quantile of the distribution of \(M_{k,H^i_t}\). The implication is that the probability that the previous region will contain at least \(H^i_t-k+1\) elements of the missing sequence is equal to (at least) \(1-\alpha \) asymptotically.

Clearly, the JPR defined in Eq. (7) is not useful since the quantile \(q^{i,t,k}_{(1-\alpha )}\) is unknown. However, it can be estimated by using some resampling scheme. Given a bootstrap pseudo series \({\mathbf {y}}^{*}_1, {\mathbf {y}}^{*}_2 \ldots , {\mathbf {y}}^{*}_T\) the bootstrap counterparts of the aforementioned quantities can be defined as \({y}^{*}_{t+h,i}-{\widehat{y}}^{*}_{t+h,i}\) with \({\widehat{y}}^{*}_{t+h,i}=g_h({\mathbf {y}}^{*}_1, {\mathbf {y}}^{*}_2, \ldots , {\mathbf {y}}^{*}_T, \widehat{\varvec{\theta }}^{*})\). Let \({\widehat{q}}^{i,t,k}_{(1-\alpha )}\) be the \(1-\alpha \) quantile of the distribution of the statistic \(M^{*}_{k,H^i_t} = k\text{- }\max _{h=0,1,\ldots ,H^i_t-1}\!(|{y}^*_{t+h,i} - {\widehat{y}}^*_{t+h,i}|)\). A two-sided JPR for \(y_{t,i},y_{t+1,i}, \ldots , y_{t+H^i_t-1,i}\) that controls the \(k\text{- }\)FWE in finite samples is given by:

$$\begin{aligned} \left[ \widehat{y}_{t,i} \pm {\widehat{q}}^{i,t,k}_{(1-\alpha )}\right] \times \left[ \widehat{y}_{t+1,i} \pm {\widehat{q}}^{i,t,k}_{(1-\alpha )}\right] \times \cdots \times \left[ \widehat{y}_{t+H^i_t-1,i} \pm {\widehat{q}}^{i,t,k}_{(1-\alpha )}\right] . \end{aligned}$$
(8)

Details are given in Algorithm 2.

figure b

The JPRs derived by the MPR method have two important advantages: first, they are proven to be asymptotically consistent under a realistic, mild high-level assumption. Second, they enjoy superior finite-sample properties, as demonstrated via extensive Monte Carlo simulations in Wolf et al. (2015). Algorithm 2 assumes a generic bootstrap method to generate the bootstrap sample. In the following we use a resampling procedure based on the residual bootstrap approach. The procedure can be implemented as detailed in Algorithm 3. The theoretical properties of the residual bootstrap scheme for time series can be derived following Choi et al. (2000).

figure c

3.2 Joint prediction regions by combining bootstrap resampling with a generalised Bonferroni technique

In order to present some alternative approaches to derive the JPRs for the missing sequences, in this section we also suggest to build them by stringing together marginal confidence intervals derived by combining a bootstrap resampling technique with a generalised version of the Bonferroni correction. Unlike the MPR method used in the previous section, the Bonferroni method is accused to be a rough method to derive joint confidence regions, since it does not take into account the dependence among the marginal confidence intervals and therefore may lead to more conservative joint confidence regions. However, to take advantage of the simplicity of this approach, in this paper we generalise the idea of the Bonferroni correction by adapting it to the k-FWE criteria, so that we can derive more flexible joint confidence regions that are substantially comparable with those based on the MPR method.

So, if the problem is to guarantee that the JPR will contain at least \(H^i_t-k+1\) elements of the missing sequence (\(1\le k<H^i_t\)) with global probability equal to \(1-\alpha \), asymptotically, then suitable individual marginal sizes \({\widetilde{\alpha }}^{(i,t)}_k\) must be set. Generalizing the idea of the Bonferroni multiple scheme, we set this value equal to

$$\begin{aligned} {\widetilde{\alpha }}_k^{(i,t)}=\max \left\{ a\in (0,1): \mathop {Pr}(A^{H^i_t}_a\le k)\ge 1-\alpha \right\} \;\text{ with }\ A^{H^i_t}_a \sim \mathop {Binom}(H^i_t, a). \end{aligned}$$

Then, the individual confidence intervals with nominal confidence level \((1-{\widetilde{\alpha }}_k^{(i,t)})\) are derived using some bootstrap method, for each h-th value of the missing sequence. In particular, we use the normal bootstrap method (i.e. the normal approximation with bootstrap estimated standard errors) and the percentile boostrap method. The final JPR derived with these two bootstrap procedures are denoted with NB and PER, respectively.

In particular, the percentile (PER) joint bootstrap prediction intervals are given similarly as in (8) in which \({\widehat{q}}^{i,t,k}_{(\alpha )}\) is replaced by \({\widetilde{q}}_{h}( {\widetilde{\alpha }}_k^{(i,t)})\) for \(h=1,\ldots ,H^i_t\), that is the \( {\widetilde{\alpha }}_k^{(i,t)}\)-quantile of the bootstrap distribution of the statistic

$$\begin{aligned} S^{*}_{h}=y^{*}_{t+h,i}-{\widehat{y}}^{*}_{t+h,i}. \end{aligned}$$

Therefore, the JPR derived by the PER method is given by

$$\begin{aligned} \left[ {\widehat{y}}_{t+h,i} + {\widetilde{q}}_{h}({\widetilde{\alpha }}_k^{(i,t)}/2), {\widehat{y}}_{t+h,i} + {\widetilde{q}}_{h}(1-{\widetilde{\alpha }}_k^{(i,t)}/2)\right] . \end{aligned}$$

Finally, the normal bootstrap (NB) joint prediction intervals are derived assuming normality for the error process. For \(h=1,\ldots ,H^i_t\), we have

$$\begin{aligned} \left[ {\widehat{y}}_{t+h,i} - { z}_{1-{\widetilde{\alpha }}_k^{(i,t)}/2}{sd}_B(S^*_h), {\widehat{y}}_{t+h,i} + { z}_{1-{\widetilde{\alpha }}_k^{(i,t)}/2}{sd}_B(S^*_h)\right] , \end{aligned}$$

where \(z_\gamma \) is the \(\gamma \)-th percentile of the standard normal distribution and \({sd}_B(S^*_h)\) is the bootstrap estimated standard deviation for \(S^*_h\).

The procedure for the NB and PER joint prediction regions can be implemented as detailed in Algorithm 2.

4 A Monte Carlo study

To validate the empirical performance of the proposed JPR in Eq. (8), we have implemented a Monte Carlo simulation study. The aim is to compare empirically the coverages and the mean lengths of the regions obtained using the normal-based method, the percentile and the Maximum Predictive Root method.

We have considered multivariate time series of dimension \(p=30\) and lengths \(T=100, 500\) and 1000. The weight matrix \({{\mathbf {W}}}\) has been randomly generated as a full rank symmetric matrix and has been row-normalized. The parameters of model (1) have been randomly generated in the interval \([-0.9, 0.9]\). The error component \({\varvec{\varepsilon }}_t\) has been generated from two different multivariate distributions. The first one is a multivariate normal distribution, with mean vector zero and diagonal variance-covariance matrix, with heteroscedastic variances \((\sigma ^2_1, \ldots ,\sigma ^2_p)\). In particular, the standard deviations \((\sigma _1, \ldots ,\sigma _p)\) have been generated randomly from a Uniform distribution U(0.5; 1.5). The second one is the multivariate distribution in which the marginal distributions are pairwise independent student-t distribution with 6 degrees of freedom. Note that, in the case of normal distributed error term, the consistency of the normal based bootstrap confidence intervals is guaranteed.

In the simulation experiment both isolated missing values and missing sequences have been considered. One missing sequences with length H has been placed at location 2 and for it three different time horizons (missing sequence length) have been considered \(H=5, 10\) and 20. The number of the isolated missing values has been fixed at 10 and they have been randomly generated at other locations.

Table 1 Mean lengths of the \(k-\)JPRs (\(k=1,2,3\)) with the normal-based (NB) method, the percentile (PER) and the Maximum Predictive Root (MPR) method in the case of normal distributed error term

In the experiment, for all the three methods, we have considered the \(k-\)JPR that is the estimated confidence region which contains at least \(H-k+1\) elements. We have fixed \(k=1,2\) and 3.

All bootstrap estimates have been computed by using \(B=999\) replicates and \(N=1000\) Monte Carlo runs.

In Table 1 the mean lengths of the joint \(k-\)JPRs (\(k=1,2,3\)) by means of the normal-based (NB) method, the percentile (PER) and the Maximum Predictive Root (MPR) methods are listed for all the considered values of T and H and for both the confidence levels \(1-\alpha =0.95\) and 0.90. In this case the distribution of the error term is the standard Gaussian.

Table 2 Empirical coverages of the \(k-\)JPR (\(k=1,2,3\)) with the normal-based (NB) method, the percentile (PER) and the Maximum Predictive Root (MPR) method in the case of normal distributed error term

The three methods present similar performance in terms of mean length. As expected, the amplitude of the regions increases as H increases. The results of Table 1 are confirmed from Table 2, in which the empirical coverages are shown for the same choices of Table 1. In order to evaluate if the coverages are significantly different with respect to the nominal level, we have calculated the asymptotic acceptance confidence interval at \(99\%\) . It is defined as \(p_0\pm 2.33\sqrt{\frac{p_0(1-p_0)}{N}}\) where \(p_0\) is fixed to the nominal level and N is the number of Monte Carlo runs. Observed coverages inside such interval can be considered not different from the fixed nominal level. In Table 2 these values are reported in bold.

The three methods present similar empirical coverages, by varying H and by varying \(\alpha \). When \(H=10\), the NB and MPR method have similar performance and generally better than the PER method, for all the values of k and for both the considered confidence levels.

By increasing H, the coverages are worse and many of them are outside the acceptance region, with levels that are far from the nominal coverage. However, the MPR method presents better performances with respect to the other two methods. When k is fixed, the lengths of the regions obtained by the three methods seem to be basically comparable.

Figure 1 refers to the case \(H=5\) (the total number of missing values is 15). The three time series lengths \(T=100, 500\) and \(T=1000\) are considered (left, center and right respectively) and \(k=1, 2\) and 3 (from the top to the bottom). All the considered methods are consistent, since by increasing T and fixing k, the coverage level becomes closer and closer to the nominal one 0.95. The NB and the MPR methods seem to have similar performance. For \(k=1\) the PER method has worse performances with respect to the others since the coverage is lower for all values of T. For \(k=2\) and \(k=3\) also the PER method seems equivalent to the others. By looking at the univariate confidence intervals, the observed coverage for all the three methods is more or less similar. They reach the nominal level for the isolated missing values, already for \(T = 100\) and \(k = 1\).

Figure 2 refers to the case \(H=10\) (the total number of missing values is 20). The results seem to be quite similar to Fig.  1, showing a better performance of the MPR method with respect to the others, expecially for \(k=2\) and \(T=100\). Also here, the PER method presents worse performances, which however improve for \(k=2\) and \(k=3\) and become comparable with the other methods. For the univariate coverages, the estimated intervals for the isolated missing values present a coverage close to the nominal one. For the sequence of missing values, empirical coverages of the univariate intervals must not be compared with the nominal coverage \(1-\alpha =0.95\), which refers to the nominal global coverage of the whole JPR. Therefore, univariate coverages of the marginal intervals for the missing sequence must necessarily be greater than \(1-\alpha \), but can be significantly reduced when \(k>1\).

Fig. 1
figure 1

Empirical coverages of the joint \(k-\)JPRs (\(k=1, 2, 3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of normal distributed error term and in the presence of 15 missing values. The first 5 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missings. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

Fig. 2
figure 2

Empirical coverages of the joint \(k-\)JPRs (\(k=1, 2, 3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of normal distributed error term and in the presence of 20 missing values. The first 10 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missing values. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

Figure 3 refers to the case \(H=20\) (the total number of missing values is 30). In this last case the MPR method outperforms the other ones.

Fig. 3
figure 3

Empirical coverages of the joint \(k-\)JPRs (\(k=1,2,3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of normal distributed error term and in the presence of 30 missing values. The first 20 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missing values. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

In Table 3 the mean lengths of the joint \(k-\)JPRs (\(k=1,2,3\)) when the error term is t(6)-distributed are shown for all the considered values of T and H. Also in this case, the three methods present similar behaviours.

In Table 4 the empirical coverages are reported for the same values of Table 2 when the error term is t-distributed. In general, the NB and PER methods present worse performances, especially for longer missing sequences. Moreover, the PER method seems to suffer more instability for the nominal coverage correction, especially when moving deep into the tails. Apparently, a much greater number of Monte Carlo replicates are needed for an accurate estimate of the bootstrap percentiles used for the construction of the JPRs.

Table 3 Mean lengths of the \(k-\)JPRs (\(k=1,2,3\)) with the normal-based (NB) method, the percentile (PER) and the Maximum Predictive Root (MPR) method in the case of \(t_{(6)}\)-distributed error term
Table 4 Empirical coverages of the \(k-\)JPRs (\(k=1,2,3\)) with the normal-based (NB) method, the percentile (PER) and the Maximum Predictive Root (MPR) method in the case of t(6)-distributed error term

Figure 4 refers to the case \(H=5\) (the total number of missing values is 15). As in the case of normal errors, all the considered methods are consistent and again the MPR method outperforms all the others. However, for \(k=1\) and for all values of T, the NB method has worse performance, while for \(k=2\) and \(k=3\) the three methods seem to be equivalent. By looking at the univariate confidence intervals, the observed coverages for all the three methods are quite similar, both for isolated and sequence of missing values. In particular, in all the cases the coverages for the isolated missing values are close to 0.95 for all T and for all k.

Fig. 4
figure 4

Empirical coverages of the \(k-\)JPRs (\(k=1,2,3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of \(t_{(6)}-\)distributed error term and in the presence of 15 missing values. The first 5 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missing values. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

Figure 5 refers to the case \(H=10\) (the total number of missing values is 20) with \(t_{(6)}-\)student errors. The results seem to be quite similar to Fig. 4, showing even more clearly a better performance of the MPR method with respect to the others, expecially for \(k=1\) and \(T=100, 500\). Also here, the NB method presents worse performances for \(k=1\), which however improve for \(k=2\) and \(k=3\). Again, for the univariate case, the estimated intervals for the isolated missing values present a coverage close to the nominal one.

Fig. 5
figure 5

Empirical coverages of the \(k-\)JPRs (\(k=1,2,3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of \(t_{(6)}-\)distributed error term and in the presence of 20 missing values. The first 10 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missing. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

Figure 6 refers to \(H=20\) (the total number of missing values is 30). In this last case the MPR method sharply outperforms the other ones mainly for the time series length \(T=100\) and 500. For the univariate case, as in the previous analysis, the estimated intervals for the isolated missing values present a coverage close to the nominal one.

Fig. 6
figure 6

Empirical coverages of the \(k-\)JPRs (\(k=1,2,3\)) with the NB (blue dashed line), PER (black dashed line) and MPR (green dashed line) methods in the case of \(t_{(6)}-\)distributed error term and in the presence of 30 missing values. The first 20 values, on the left of the vertical line, represent the missing sequence, while the remaining values are isolated missing values. The nominal level is fixed to 0.95 (red line). The empirical coverages for the univariate NB intervals are blue “+”; the univariate PER are black “o” and the univariate MPR are green “\({\varDelta }\)” (colour figure online)

5 An application to real data

In order to evaluate how the proposed procedure works on real data, we consider daily \(PM_{10}\) data (in \(\mu g/m^3\)) from 1 January 2015 to 19 October 2016 (658 days) at 24 sites in Piemonte (see http://www.arpa.piemonte.gov.it). The particulate matter \(PM_{10}\) emissions are regulated by the EU which has set two limit values for the protection of human health: the daily mean value may not exceed \(50\, \upmu \mathrm{g/m^3}\) more than 35 times in a year and the annual mean value may not exceed \(40 \mu g / m^3\). In the considered dataset, isolated missing values as well as missing sequences are present due to the defaults in the monitoring stations. Their point reconstruction has been addressed in Parrella et al. (2019) by using model (1). The spatial matrix \({{\mathbf {W}}}\) has been set as the normalized geographical distances matrix of \({{\mathbf {y}}}_t\), i.e. its entries are \(w_{ij}/\sum _{k=1}^p w_{kj}\), where \(w_{ij}=1/(1+d_{ij})\), \(d_{ij}\) is the geographical distance between the i-th and j-th stations for \(i \ne j\), and \(w_{ii} = 0\) for \(i=1,\ldots ,p\).

Here, for sake of illustration, we analyse the data from Novara-Verdi station which presents the \(14.74\%\) of missing data. In particular there is one isolated missing value on 12 January 2015 and then a sequence of missing values 42 observations long, starting from 11 March 2015. Figures 7 and 8 report the observed time series, along with the imputed values and the \(k-\)JPRs (\(k=1,2,3\)) for the missing values obtained with the MPR and the NB methods respectively, in the time interval up to 30/04/2015. The confidence level is fixed at \(90\%\) and the red horizontal line is the EU limit \(50\, \upmu \mathrm{g/m^3}\) .

It is evident that, in both the cases, the reconstruction procedure for the missing values provides values that are quite often under the EU threshold. However, by looking at \(k-\)JPRs obtained with the MPR method in 7, their upper bounds is above 50 almost everywhere expecially for \(k=1\). Also for the isolated missing value, the univariate confidence interval includes the EU limit. Moreover, as expected, by increasing k, the lengths of the \(k-\)JPRs decrease.

When using the NB method (Fig. 8), by varying k, lengths are closer to each other, with respect to MPR case.

Fig. 7
figure 7

\(PM_{10}\) data from Novara-Verdi station in the interval 01/01/2015–30/04/2015 (black line). The gray circles represent the imputed values for the missing values. The coloured lines are the \(k-\)JPRs, for \(k=1\) (green), 2 (blue) and 3 (magenta), calculated by using MPR method. Red horizontal line is the EU limit, i.e. \(50\, \upmu \mathrm{g/m^3}\) (colour figure online)

Fig. 8
figure 8

\(PM_{10}\) data from Novara-Verdi station in the interval 01/01/2015–30/04/2015 (black line). The gray circles represent the imputed values for the missing values. The coloured lines are the \(k-\)JPRs, for \(k=1\) (green), 2 (blue) and 3 (magenta), calculated by using NB method. Red horizontal line is the EU limit, i.e. \(50\, \upmu \mathrm{g/m^3}\) (colour figure online)

6 Concluding remarks

In this paper we deal with the construction of joint prediction intervals for missing data patterns in the context of spatio-temporal data. We have proposed a bootstrap resampling scheme able to provide joint prediction regions that approximately contain missing paths of a time series of interest, with probability \(1-\alpha \). The approach is based on maximum predictive roots proposed in Pan (2016) and Wolf et al. (2015). We have also considered JPRs that only contain all elements of missing paths up to a small number \(k-1\) of them with probability \(1-\alpha \), where the choice of k can be made with respect to the problem at hand.

A simulation experiment has been performed to validate the empirical performance of the proposed method based on the maximum of the predictive root statistic and to compare it with two simpler alternatives. In particular we have considered multivariate time series generated by a SDPD model (see Dou et al. 2016) with \(p=30\) and different sample sizes \(T=100, 500\) and 1000 and two distributions for the error terms, normal and student-t. In the experiment we have analysed the mean lengths of the obtained JPRs along with their empirical coverages both in the cases of isolated missing values and for missing sequences of different lengths. The simulation experiment has shown that generally the procedure based on the maximum predictive root delivers better and more stable performances for non-gaussian error terms and longer missing sequences, with similar mean lengths for the interpolation regions.

The application on real data shows that the reconstruction procedure for the missing values provides values that are in most of the cases under the EU threshold. However, the \(k-\)JPRs obtained with the MPR method produce intervals which quite often contain the EU limit of \(50 \,\upmu \mathrm{g/m}^3\), showing the importance of the additional information delivered by prediction intervals with respect to single predictions.