1 Introduction

Cliff–Ord (Cliff and Ord 1969) spatial models are based on the implicit assumption that the information about the spatial location of statistical units is accurate. Whilst this circumstance is the norm in the context of areal data (such as municipalities, counties or regions), it is rarely met when the observations are points in space (such as firms, houses or facilities), whose locations may be either missing or affected by locational errors (see Zimmerman 2008; Zimmerman and Li 2010; Arbia et al. 2019b).

Although geolocation may fail for some units because of technical reasons, incomplete positioning arises more frequently in geocoding processes, especially in those circumstances where coordinates of units are obtained by matching postal addresses with georeferenced street maps (see e.g. Kravets and Hadden 2007). Clearly, the quality of the resulting geolocation depends both on the correctness and completeness of postal addresses, as well as on the effectiveness of matching algorithms and software, nonetheless, if position of some units is uncertain, this fact should be properly considered in the estimation process.

When an incomplete address is geocoded, unit position is conventionally imputed to the centroid of the area where unit is located, as it can be known from address information. Such areas may be counties, municipalities, or, more frequently, ZIP code areas (Zimmerman 2008). From a statistical point of view, the presence of locational errors due to coarsened locations may have a significant impact on parameter estimates of spatial models based on the Cliff–Ord approach (Cliff and Ord 1969), as positional errors lead to downward biased estimates for the spatial autoregressive parameters and inconsistent estimates for covariates coefficients (Arbia et al. 2016).

This paper tackles the problem of estimating spatial models where part of units is affected by coarsening. In particular, we focus on the spatial autoregressive model (see e.g. Cressie 2015, ch. 6). The proposed estimation strategy models both the spatial stochastic process and the coarsening mechanism by means of a marked point process whose intensity function is estimated according to the coarsened-data estimator proposed by Zimmerman (2008). The model is then fitted through the maximisation of a doubly-marginalised likelihood function of the marked point process, which cleans out the effects of coarsening.

The first marginalisation of the likelihood function allows the dimensionality of the model to be consistently reduced to non-coarsened points, and it is derived analytically. The second marginalisation is performed via Monte Carlo simulations over the locations of coarsened points.

The modelling approach and Monte Carlo experiments presented in the paper show the validity of the proposed estimation method compared to other estimation approaches. In particular, the comparison concerns the parameter estimates and the direct and indirect effects of model covariates on the dependent variable (Arbia et al. 2019a).

The paper is organised as follows. Section 2 describes the modelling approach and the notation adopted. Section 3 illustrates and discusses the proposed estimation approach. Section 4 illustrates the results of Monte Carlo simulations where the finite properties of parameter estimators and estimators of direct and indirect impacts of regressors are studied. In Sect. 5 the method is applied to spatial hedonic models for house prices in Beijing. Section 6 concludes the paper.

2 Modelling approach and notation

Consider n units \(i=1, \ldots ,n\) for which a quantitative characteristic of interest \(y_i\in {{\,\mathrm{\mathbb {R}}\,}}\) and k regressors \(x_i\in {{\,\mathrm{\mathbb {R}}\,}}^k\) are known. Assume that postal addresses are available for all n units, however only \(p<n\) of them are complete, whereas \(n-p\) are incomplete. Assume also, that the \(n-p\) units with incomplete addresses can be assigned to, say, the ZIP areas they actually belong to.

Under these conditions, if a spatial Cliff–Ord model is used for modelling y (a thorough illustration of the reasons why a spatial modelling approach may be necessary is available e.g. in LeSage and Pace 2009, ch. 2), the coarsening of the \(n-p\) unit locations only affects the specification of the spatial weight matrix, as \(y_i\) and \(x_i\) are known for all units \(i=1, \ldots ,n\).

Consider, for example, the following isotropic spatial autoregressive model (SAR):

$$\begin{aligned} {\left\{ \begin{array}{ll} y=\rho Wy+X\beta +\varepsilon \\ \varepsilon \sim {{\,\mathrm{\mathcal {N}}\,}}_n(0,\sigma ^2I_n) \end{array}\right. } \end{aligned}$$
(1)

where \(X\in {{\,\mathrm{\mathbb {R}}\,}}^{n\times k}\) is the design matrix which includes k regressors, and \(W\in {{\,\mathrm{\mathbb {R}}\,}}^{n\times n}\) is the usual zero-diagonal spatial weight matrix whose elements \(w_{ij}\) take positive values according to some proximity criterion, whereas equal zero if units i and j are not considered as neighbours.

It can be verified that, if p/n is the proportion of non-coarsened units, the share of elements of W not affected by coarsening is only about \((p/n)^2\), whereas all elements change if W is stochastic (that is, if W is row-standardised).

Although the magnitude of the effects of coarsening on the spatial weight matrix is the cause of bias of estimators of the autoregressive parameter \(\rho \) (Arbia et al. 2016), it is worth stressing that the bias of the estimators of \(\rho \) is not just originated from perturbations in the values of weights within neighbourhoods, but it is mainly ascribable to modifications in the neighbourhood relations amongst units, as shown in Santi et al. (2020).

On the other hand, the biasedness of the estimator of the autoregressive parameter \(\rho \) in model (1) gives rise to biasedness of estimators of the other parameters too. It can be proved (see Appendix A.1) that

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}(\hat{\beta }-\beta \vert \hat{\rho }) =(X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\left( \sum _{j=1}^\infty \rho ^jW^j\right) X\beta \,\frac{\hat{\rho }-\rho }{\rho } , \end{aligned}$$
(2a)

so that the bias on \(\beta _j\) (for any \(j=1, \ldots ,k\)) gets larger as the relative bias of \(\hat{\rho }\) grows, provided that \(\beta _j\ne 0\); whereas the conditional bias of \(\hat{\sigma }^2\) has the form (see Appendix A.2):

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}(\hat{\sigma }^2-\sigma ^2\vert \hat{\rho })&=-\frac{k}{n}\,\sigma ^2+\left[ \frac{(Q_\rho X\beta )^{\mathrm{T}}(Q_\rho X\beta )}{n} +\sigma ^2\,\frac{{{\,\mathrm{tr}\,}}(Q_\rho ^{\mathrm{T}}Q_\rho )}{n} \right] \,(\hat{\rho }-\rho )^2\nonumber \\&\quad -2\sigma ^2\,\frac{{{\,\mathrm{tr}\,}}(Q_\rho )}{n}\,(\hat{\rho }-\rho ) , \end{aligned}$$
(2b)

where \(Q_\rho {{\,\mathrm{\equiv }\,}}(I_n-X(X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}})W(I_n-\rho W)^{-1}\). In this case, unlike estimator \(\hat{\beta }\), the maximum likelihood estimator \(\hat{\sigma }^2\) is biased even if \(\hat{\rho }\) is unbiased, because of the second term on the right-hand side of Eq. (2b).

Equations (2) show that biasedness of estimator \(\hat{\rho }\) reverberates on estimates of other parameters, and thus on derived quantities such as covariate impacts (see Sect. 3.2), confidence intervals, and prediction intervals. This motivates the adoption of estimation methods aiming at reducing the bias of \(\hat{\rho }\) originated from coarsening of unit locations.

The estimation method proposed in this paper basically reduces the dimensionality of the model by concentrating the likelihood on the p non-coarsened units, thus limiting the effects of the coarsened locations on model estimates and, at the same time, exploiting the available information about covariates and zone-based location of the coarsened units.

The problem is modelled as a marked point process where both the stochastic spatial process and the coarsening process are specified conditionally on the underlying point process.

Let \((\varOmega , \mathcal {F}, \mathbb {P})\) be a probability space, and let \(Z\in {{\,\mathrm{\mathbb {R}}\,}}^{n\times 2}\) be a realisation of n points from a 2-dimensional point process \(\{Z(s,\omega ):s\in S\}\) defined over a bounded metric space \((S,\Vert \cdot \Vert )\), where \(S\subset {{\,\mathrm{\mathbb {R}}\,}}^2\). Let \(\lambda :S\rightarrow {{\,\mathrm{\mathbb {R}}\,}}^+\) be the intensity function of \(\{Z(s,\omega ):s\in S\}\) defined as:

$$\begin{aligned} \lambda (s)=\lim _{|\text {d}s|\rightarrow 0}\frac{{{\,\mathrm{\mathbb {E}}\,}}(N(s,\text {d}s))}{\text {d}s} , \end{aligned}$$

in which \(N(s,\text {d}s)\) is the count function for points in the neighbour \(\text {d}s\subset S\) centred in \(s\in S\) (see e.g. Illian et al. 2008).

Conditionally on Z, the isotropic SAR (1) is defined for the spatial process y, where the spatial weight matrix W is row-standardised, and its elements \(w_{ij}\) are defined as follows:

$$\begin{aligned} w_{ij}= {\left\{ \begin{array}{ll} \frac{\kappa (\Vert z_i-z_j\Vert )}{\sum _{v=1}^n\kappa (\Vert z_i-z_v\Vert )} &{} \text {if }i\ne j\text { and }\sum _{v=1}^n\kappa (\Vert z_i-z_v\Vert )\ne 0\\ 0 &{} \text {otherwise}\\ \end{array}\right. } , \end{aligned}$$
(3)

for any \(i,j\in \{1, \ldots ,n\}\), and some non-increasing function \(\kappa :{{\,\mathrm{\mathbb {R}}\,}}^+_0\rightarrow {{\,\mathrm{\mathbb {R}}\,}}^+_0\) such that \(\lim _{x\rightarrow \infty }\kappa (x)=0\). Common choices for \(\kappa \) are \(\kappa (d)=\alpha /d\), \(\kappa (d)=\alpha /d^2\), \(\kappa (d)=\mathrm {e}^{-\alpha d}\), \(\kappa (d)=\mathrm {e}^{-\alpha d^2}\), \(\alpha \) being some positive constant. Often, a cut-off distance \(\bar{d}\) is also specified, so that \(\kappa (d)=\mathbb {1}_{\{d\le \bar{d}\}}\alpha /d^2\). — See Anselin (1988) and Anselin (2002) for a discussion on the alternative specifications of spatial weight matrices (and thus of the decay function \(\kappa \)); whereas LeSage and Pace (2014) and Santi et al. (2020) analyse the effects of spatial weight matrix misspecification.

The coarsening process can be either dependent on the intensity function \(\lambda \) and the realisation of the point process \(\{Z(s,\omega ):s\in S\}\) or independent from them. Here we just assume that the coarsening is modelled by means of a random vector \(\varPhi \), which is a realisation of n Bernoulli random variables independent from the spatial process y conditionally on the point process Z. Thus, the components \(\varPhi _i\) of the random vector \(\varPhi \) are defined as follows:

$$\begin{aligned} \varPhi _i\sim \mathcal {B}(p_i) \end{aligned}$$
(4)

for \(i=1, \ldots ,n\), and take value \(\varPhi _i=0\) if point i is coarsened, whereas \(\varPhi _i=1\) if point i has been correctly geocoded.

Finally, let \(\mathcal {S}=\{S_1,S_2, \ldots ,S_R\}\) be a partition of the space S into R regions such that, for any unit i with coordinate \(z_i\in S\), it exists one region \(S_r\) such that \(z_i\in S_r\).Footnote 1 It is assumed that, for each coarsened unit i, the region \(S_r\) where i is located is known.

To sum up, for all units \(i=1, \ldots ,n\) the values of the dependent variable \(y_i\) and the covariates \(x_i\) are known. For non-coarsened units \(i=1, \ldots ,p\) the coordinates \(z_i\in S\) are known, whereas it is known the coarsening area \(S_r\) of each coarsened unit \(i=p+1, \ldots ,n\) such that \(z_i\in S_r\). Other missing or unknown information such as the values of parameters and the coordinates of coarsened units about model (1) should be either learnt (through estimation) or made it non-relevant (through marginalisation).

Before illustrating our proposal for tackling the estimation problem, we introduce the notation that will be used throughout the rest of the paper.

We denote by subscript P and subscript C non-coarsened and coarsened points respectively (that is, points where \(\varPhi _i=1\) and \(\varPhi _i=0\) respectively). Conditionally on the random vector \(\varPhi \), SAR (1) can be restated as it follows:

$$\begin{aligned} \begin{bmatrix} y_P \\ y_C \end{bmatrix} = \rho \begin{bmatrix} W_{PP} &{} W_{PC} \\ W_{CP} &{} W_{CC} \end{bmatrix} \cdot \begin{bmatrix} y_P \\ y_C \end{bmatrix}+ \begin{bmatrix} X_P \\ X_C \end{bmatrix} \beta + \begin{bmatrix} \varepsilon _P \\ \varepsilon _C \end{bmatrix} \end{aligned}$$
(5)

provided the original SAR is properly permuted by means of a suitable permutation matrix \(P_\varPhi \in \{0,1\}^{n\times n}\), that is:

$$\begin{aligned}&\begin{bmatrix} y_P \\ y_C \end{bmatrix} =P_\varPhi y ,&\begin{bmatrix} W_{PP} &{} W_{PC} \\ W_{CP} &{} W_{CC} \end{bmatrix} =P_\varPhi WP_\varPhi ,&\begin{bmatrix} X_P \\ X_C \end{bmatrix} =P_\varPhi X ,&\begin{bmatrix} \varepsilon _P \\ \varepsilon _C \end{bmatrix} =P_\varPhi \varepsilon . \end{aligned}$$
(6)

Restatement (5) permits observations about coarsened (C) and non-coarsened (P) points to be organised in block matrices.

We also define matrix \(A{{\,\mathrm{\equiv }\,}}I_n-\rho P_\varPhi WP_\varPhi \in {{\,\mathrm{\mathbb {R}}\,}}^{n\times n}\), so that:

$$\begin{aligned} A= \begin{bmatrix} A_{PP} &{} A_{PC} \\ A_{CP} &{} A_{CC} \end{bmatrix}= \begin{bmatrix} I_p-\rho W_{PP} &{} -\rho W_{PC} \\ -\rho W_{CP} &{} I_{n-p}-\rho W_{CC} \end{bmatrix} . \end{aligned}$$
(7)

Finally, it can be proved (see e.g. Lu and Shou 2002) that the following relations hold for the inverse matrix \(A^{-1}\):

$$\begin{aligned} A^{-1}= \begin{bmatrix} A_{PP}^{-1}+A_{PP}^{-1}A_{PC}\tilde{\varXi }^{-1}A_{CP}A_{PP}^{-1}&{} \quad -A_{PP}^{-1}A_{PC}\tilde{\varXi }^{-1}\\ -\tilde{\varXi }^{-1}A_{CP}A_{PP}^{-1}&{} \quad \tilde{\varXi }^{-1}\end{bmatrix} \end{aligned}$$
(8)

where \(\tilde{\varXi }{{\,\mathrm{\equiv }\,}}A_{CC}-A_{CP}A_{PP}^{-1}A_{PC}\) is the Schur complement of \(A_{PP}\) and

$$\begin{aligned} A^{-1}&{{\,\mathrm{\equiv }\,}}\begin{bmatrix} (A^{-1})_{PP} &{} (A^{-1})_{PC} \\ (A^{-1})_{CP} &{} (A^{-1})_{CC} \end{bmatrix}\nonumber \\&= \begin{bmatrix} \varXi ^{-1}&{} \quad -\varXi ^{-1}A_{PC}A_{CC}^{-1}\\ -A_{CC}^{-1}A_{CP}\varXi ^{-1}&{} \quad A_{CC}^{-1}+A_{CC}^{-1}A_{CP}\varXi ^{-1}A_{PC}A_{CC}^{-1}\end{bmatrix} \end{aligned}$$
(9)

where \(\varXi {{\,\mathrm{\equiv }\,}}A_{PP}-A_{PC}A_{CC}^{-1}A_{CP}\) is the Schur complement of \(A_{CC}\) (see e.g. Horn and Johnson 2013).

3 Estimation strategy

3.1 Model fitting

The reduced form of model (5) based on inversion (8) permits the following equation to be derived:Footnote 2

$$\begin{aligned} y_P =\rho W_{PP}y_P+X_P\beta +\varepsilon _P +A_{PC}\tilde{\varXi }^{-1}\left[ A_{CP}A_{PP}^{-1}\,(X_P\beta +\varepsilon _P) -(X_C\beta +\varepsilon _C)\right] .\nonumber \\ \end{aligned}$$
(10)

Left-hand side term of Eq. (10) together with the first three terms of the right-hand side perfectly describe a SAR amongst correctly geo-referenced points, sharing the same parameters of the complete model (1). Unfortunately, the last term on the right-hand side makes things more complicated.

The fourth term on the right-hand side of Eq. (10) proves that, in general, any subset of observations of a SAR does not follow a SAR. Indeed, it makes the estimation process of a SAR with coarsened points particularly tricky, since Eq. (10) includes blocks of matrix A which depend on the (unknown) coordinates of the coarsened points.

As previously stated, the estimation strategy proposed in this paper relies on a double marginalisation of the likelihood function of the SAR (1). In particular, the former marginalisation should be made with respect to \(y_P\), thus concentrating the information about coarsened points into a lower dimensional space. A similar approach to the marginalisation of the SAR has already proved to be successful in the context of variance estimation in 2-dimensional systematic sampling (see Espa et al. 2017). The latter marginalisation should instead be made with respect to the point process of non-coarsened points \(Z_P\), so as to include direct and indirect effects of positional errors in the (marginal) probability distribution of \(y_P\).

The first marginalisation can be derived in closed form from the reduced form of model (1) based on inversion (9), and equals

$$\begin{aligned} y_P =\varXi ^{-1}X_P\beta +\varXi ^{-1}\varepsilon _P -\varXi ^{-1}A_{PC}A_{CC}^{-1}(X_C\beta +\varepsilon _C) , \end{aligned}$$

which implies that:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}(y_P|Z,\varPhi )&=\varXi ^{-1}X_P\beta +\rho \,\varXi ^{-1}W_{PC}A_{CC}^{-1}X_C\beta , \end{aligned}$$
(11a)
$$\begin{aligned} {{\,\mathrm{cov}\,}}(y_P|Z,\varPhi )&=\sigma ^2\,\varXi ^{-1}(I_p+\rho ^2\,W_{PC}(A_{CC}^{\mathrm{T}}A_{CC})^{-1}W_{PC}^{\mathrm{T}})(\varXi ^{-1})^{\mathrm{T}}, \end{aligned}$$
(11b)

so that the log-likelihood function \(\ln {{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y,X,Z,\varPhi )\) of the of the model (1) marginalised with respect to \(y_P\) equals:

$$\begin{aligned}&\ln {{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y_P,X,Z,\varPhi ) =-\frac{p}{2}\,\ln (2\pi \sigma ^2)+\ln |\varXi |\nonumber \\&\quad -\frac{1}{2}\,\ln |I_p+\rho ^2\,W_{PC}(A_{CC}^{\mathrm{T}}A_{CC})^{-1}W_{PC}^{\mathrm{T}}|\nonumber \\&\quad -\frac{1}{2\sigma ^2}\,(\varXi y_P-X\beta -\rho W_{PC}A_{CC}^{-1}X_C\beta )^{\mathrm{T}}\cdot \nonumber \\&\quad \cdot (I_p+\rho ^2\,W_{PC}(A_{CC}^{\mathrm{T}}A_{CC})^{-1}W_{PC}^{\mathrm{T}})^{-1}\cdot \nonumber \\&\quad \cdot (\varXi y_P-X\beta -\rho W_{PC}A_{CC}^{-1}X_C\beta ) . \end{aligned}$$
(12)

The second marginalisation requires the intensity function \(\lambda \) to be estimated, so as to characterise the first-order properties of the spatial point process \(\{Z(s,\omega ):s\in S\}\) and, in turn, the probabilistic law of the spatial weight matrix W under coarsened geocoding.

The point process \(\{Z(s,\omega ):s\in S\}\) along with the coarsening process \(\{\varPhi _i:i=1, \ldots ,n\}\) defines a bivariate point pattern (Illian et al. 2008).Footnote 3 According to Zimmerman (2008), for any \(s\in S\), the intensity function \(\lambda :S\rightarrow {{\,\mathrm{\mathbb {R}}\,}}^+\) of the spatial point pattern affected by incomplete geocoding can be estimated as follows:

$$\begin{aligned} \hat{\lambda }(s)=\sum _{i=1}^n[\hat{\phi }(z_i)]^{-1}K_h(s-z_i), \end{aligned}$$
(13)

where \(K_h\) is some kernel function with bandwidth h, \(z_i\) is the observed location of unit i, and \(\hat{\phi }\) is an estimate of the geocoding propensity function \(\phi :S\rightarrow (0,1]\), whose reciprocal (\(1/\hat{\phi }\)) is used as the weighting criterion of the kernel estimator.

As for \(K_h\), Zimmerman (2008) uses a Gaussian kernel whose bandwidth is automatically selected by minimising the mean-square error statistic defined in Diggle (1985) through cross-validation (Berman and Diggle 1989).

In operative terms, Zimmerman (2008) estimates the intensity function \(\lambda \) through the R (R Core Team 2020) function density.ppp of package spatstat (Baddeley et al. 2015), whereas the bandwidth is computed by means of the function bw.diggle (of package spatstat as well). Monte Carlo simulations illustrated in Sect. 4 and the application to real data discussed in Sect. 5 use the same functions.

The geocoding propensity function \(\phi \) can be estimated in various ways, according to the available information about the coarsening process. In this paper, the values of the coarsening probabilities in (4) are assumed to be such that \(p_i=\phi (z_i)\), given the coordinate \(z_i\in S\) of the unit i. It follows that:

$$\begin{aligned} \hat{\phi }(s)= \frac{\sum _{r=1}^R\sum _{i=1}^n\varPhi _i\mathbb {1}_{\{z_i\in S_r\}}\mathbb {1}_{\{s\in S_r\}}}{\sum _{r=1}^R\sum _{i=1}^n\mathbb {1}_{\{z_i\in S_r\}}\mathbb {1}_{\{s\in S_r\}}} , \end{aligned}$$
(14)

so that \(\hat{\phi }\) is constant over each region \(S_r\in \mathcal {S}\), and equals the proportion of non-coarsened points in \(S_r\).

The solution we propose in this paper consists in five steps which are summarised in Algorithm 1.

Algorithm 1

(Double-marginalisation estimation)

  1. 1.

    the geocoding propensity function \(\phi \) is estimated over \(\mathcal {S}\) through estimator (14);

  2. 2.

    the intensity function \(\lambda \) of the coarsened point process Z is estimated according to Zimmerman (2008) through estimator (13);

  3. 3.

    the likelihood of SAR (1) marginalised with respect to \(y_P\) is derived from (11); we denote that likelihood function by \({{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y_P,X,Z,\varPhi )\);

  4. 4.

    the likelihood \({{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y_P,X,Z,\varPhi )\) is marginalised with respect to \(Z_P\), that is:

    (15)

    where \(\hat{\varrho }:S^{n-p}\rightarrow {{\,\mathrm{\mathbb {R}}\,}}^+\) is the conditional probability density function of \(Z_C|Z_P\) implied by the estimated intensity function \(\hat{\lambda }\);

  5. 5.

    marginal likelihood \({{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y,X,Z_P,\varPhi )\) is maximised with respect to \(\rho \), \(\beta \) and \(\sigma ^2\).

As anticipated, marginalisation (15) has to be performed numerically since it seems impossible to compute it analytically. Anyway, two issues may make the outlined method computationally unfeasible.

Firstly, the high-dimensional integration space in (15) may substantially deteriorate the performances of Monte Carlo integration methods.

Secondly, the need to evaluate integral (15) at every step of the optimisation procedure dramatically exacerbates the problem outlined in the previous point.

In order to overcome both problems (and the second in particular), we rely on the cross-entropy algorithm for the optimisation of noisy functions (Rubinstein and Kroese 2004). Unlike other numerical optimisation methods such as the Expectation-Maximisation algorithm (Dempster et al. 1977; Robert and Casella 2004), at each iteration the cross-entropy algorithm simultaneously performs the marginalisation and the optimisation of the likelihood function \({{\,\mathrm{\mathcal {L}}\,}}(\rho ,\beta ,\sigma ^2|y,X,Z_P,Z_C,\varPhi )\). This leads to a substantial reduction of the computational burden required by the optimisation routine.

Results of Monte Carlo simulations discussed in the next section have been performed adopting the same parameters and instrumental distributions of the cross-entropy algorithm as described in Bee et al. (2017), where the method has been applied to maximum likelihood estimation of generalised linear multilevel models (the only exception is in the number N of draws, as it will be clarified later).

3.2 Impact estimators

According to LeSage and Pace (2009), the effects of covariates on the dependent variable of a SAR do not solely depend on regression coefficients \(\beta \), as the spatially-lagged dependent variable induces an indirect effect resulting from the autoregressive parameter \(\rho \) and the spatial weight matrix W. It follows that the overall impact of a regressor on the value of the dependent variable can be decomposed in a direct and an indirect impact, which, however, it is not constant amongst all units. For these reasons, averages of total (\(T(\beta )\)), direct (\(D(\beta \))), and indirect (\(M(\beta )\)) impacts are usually computed (LeSage and Pace 2009):

$$\begin{aligned} T(\beta )&=n^{-1}\,\iota _n^{\mathrm{T}}(I-\rho W)^{-1}\iota _n\beta , \end{aligned}$$
(16a)
$$\begin{aligned} D(\beta )&=n^{-1}\,{{\,\mathrm{tr}\,}}(I-\rho W)^{-1}\beta , \end{aligned}$$
(16b)
$$\begin{aligned} M(\beta )&=T(\beta )-D(\beta ) . \end{aligned}$$
(16c)

According to the model we have described in Sect. 2, some elements of the spatial weight matrix W are uncertain when geocoding is not complete. It follows that impacts should be estimated via Monte Carlo simulations, where the weight matrices are defined according to the realisations of a point process Z with estimated intensity function \(\hat{\lambda }\). Thus, the Monte Carlo estimators of the impact measures (16) can be defined as follows:

$$\begin{aligned}&\widehat{(A^{-1})}=\frac{1}{N}\sum _{k=1}^N(I-\hat{\rho } W_k)^{-1} ,&\hat{T}(\hat{\beta })=n^{-1}\,\iota _n^{\mathrm{T}}\widehat{(A^{-1})}\iota _n\hat{\beta } ,\\&\hat{D}(\hat{\beta })=n^{-1}\,{{\,\mathrm{tr}\,}}\widehat{(A^{-1})}\hat{\beta } ,&\hat{M}(\hat{\beta })=\hat{T}(\hat{\beta })-\hat{D}(\hat{\beta }) . \end{aligned}$$

Since Monte Carlo estimation of matrix \(\widehat{(A^{-1})}\) may be computationally demanding, because of the inversions of the weight matrices \(W_k\), a truncated geometric series of \((I-\hat{\rho } W_k)^{-1}\) may reduce substantially the computational burden of the simulation:

$$\begin{aligned} \widehat{(A^{-1})}=\frac{1}{N}\sum _{k=1}^N\sum _{h=0}^m\hat{\rho }^hW_k^h . \end{aligned}$$

where m represents the truncation order.

3.3 Asymptotics and generalisations

As stated in the introduction, this paper aims at proposing an estimation method for spatial models à la Cliff–Ord (Cliff and Ord 1969), where a portion of data is affected by coarsening, thus the primarily interest is devoted to the parameters of the model, as well as other measures of covariates’ effects (like, e.g. direct, indirect and total impacts, which are discussed in Sect. 3.2).

The double marginalisation performed in Algorithm 1 derives from the following marginalisation of the probability density function of the marked point process:

$$\begin{aligned} f_{Y_P\vert Z_P}(y_P\vert z_P) =\int \left[ \int f_{Y_P,Y_C\vert Z_P,Z_C}(y_P,y_C\vert z_P,z_C)\,\text {d}y_C\right] \varrho (z_C\vert z_P)\,\text {d}z_C , \end{aligned}$$
(17)

where conditioning of all probability density functions with respect to the coarsening vector \(\varPhi \) has been omitted for notational simplicity, and \(\varrho (z_C\vert z_P)=f_{Z_C\vert Z_P}(z_C\vert z_P)\) has been denoted consistently to the notation of Eq. (15). The inner integral of Eq. (17) corresponds to the first marginalisation described at point 3 of Algorithm 1, whereas the outer integral determines the marginalisation of Eq. (15).

The inner integral of Eq. (17) is a marginalisation of a model whose maximum likelihood estimators have been proved to be consistent by Lee (2004), provided that specific requirements on the asymptotic specification of the spatial weight matrix W are satisfied. Yet, the asymptotic behaviour of the double-marginal estimator depends also on both the geocoding propensity function estimator (14) and the intensity function estimator (13); both estimators are consistent, and their asymptotic properties are discussed in Zimmerman (2008), however this is not enough to guarantee that the double-marginal estimator is consistent too. The reason for this is that the spatial weight matrix is built according to both the spatial point pattern and the coarsening process, and its asymptotic behaviour is fully determined by the properties of that two processes. To our knowledge, at the moment, there are no theoretical results which can be exploited in order to prove (or refuse) the consistency of double-marginal estimator.

As for the applicability of the double-marginal estimator, it is worth pointing out that it can be easily adapted or generalised to other coarsening mechanisms, point processes, or stochastic spatial processes, as it is only required that the model can be identified and marginalised. Thus, if a spatial model other than the SAR is considered, Algorithm 1 changes in step 3, where the likelihood function of the model is marginalised with respect to non-coarsened units, whereas the rest of the algorithm does not change.

A special case is represented by the spatial Durbin model (SDM), which generalises the SAR model by including some (or all) spatially lagged covariates amongst the regressors. In this case, both Algorithm 1 and likelihood function (12) are valid without modifications, provided that the design matrix X is properly redefined so as to include extra covariates.

The family of Cliff–Ord spatial models consists of other specifications which somehow include other forms of spatial dependence or allows for other specifications of the covariate effects. An extensive review of the existing Cliff–Ord spatial models can be found in Cressie (2015), Anselin (1988), and LeSage and Pace (2009). Here it is worth reminding the general nesting spatial model (GNSM) defined in Elhorst (2014):

$$\begin{aligned} {\left\{ \begin{array}{ll} y=\rho Wy+\alpha \iota _n+X\beta +WX\theta +u\\ u=\lambda Wu+\varepsilon \\ \varepsilon \sim {{\,\mathrm{\mathcal {N}}\,}}_n(0,\sigma ^2I_n) \end{array}\right. } \end{aligned}$$
(18)

Although the GNSM (18) is not identifyable, it deserves consideration, as it includes the main Cliff–Ord spatial models as special cases, if one or more restrictions are apllied to its parameters.—For example, the SDM is obtained when \(\lambda =0\), whereas the SAR model (1) results if \(\lambda =0\) and \(\delta =0\) (the constant vector \(\iota _n\) can be included into the design matrix X).

The full log-likelihood of the GNSM (18) can be proved to be:

$$\begin{aligned}&{{\,\mathrm{\ln {\mathcal {L}}}\,}}(\rho ,\lambda ,\alpha ,\beta ,\delta ,\sigma ^2) =-\frac{n}{2}\ln (2\pi \sigma ^2)+\ln |A_\rho |+\ln |A_\lambda |\nonumber \\&\quad -\frac{[A_\lambda (A_\rho y-\alpha \iota _n-X\beta -WX\theta )]^{\mathrm{T}}[A_\lambda (A_\rho y-\alpha \iota _n-X\beta -WX\theta )]}{2\sigma ^2} , \end{aligned}$$
(19)

where \(A_\rho =I_n-\rho W\) and \(A_\lambda =I_n-\lambda W\).

The likelihood of models nested in GNSM can be derived from (19), whereas the first analytical marginalisation can be derived through inversions (8) and (9). Once the first marginalisation has been derived, Algorithm 1 can be applied just as illustrated above.

4 Monte Carlo simulations

The performances of the proposed estimation approach in finite samples have been studied by means of Monte Carlo simulations. The complication of both the modelling setting and estimation method considerably widens the variety of scenarios which should be considered for studying the estimators’ properties in finite samples.

Fig. 1
figure 1

Intensity function \(\lambda \) used for generating the point process (left) and the realisation of the process for \(n=250\) with hexagonal partition (\(R=17\)) of the space (right)

In this section twelve different scenarios are considered:

  1. (A)

    a point pattern with \(n=250\) points is generated over an irregular area S according to an inhomogeneous Poisson process with the intensity function \(\lambda \) represented in Fig. 1. The surface S is partitioned into \(R=17\) hexagonal regions of equal size excepting for border zones (see Fig. 1). The SAR includes two regressors (generated as realisations of a standard normal distribution) and a constant term, so that \(X\in {{\,\mathrm{\mathbb {R}}\,}}^{n\times 3}\). The parameters of the SAR are \(\rho =0.5\), \(\beta =[1, 1, -1]^{\mathrm{T}}\), \(\sigma ^2=1\), whereas the spatial weight matrix W is computed according to (3), and \(\kappa (x)=\mathbb {1}_{\{x\le 0.5\}}\) (note that sides of hexagons measure 1.5). Each unit of the point pattern is independently coarsened with probability 0.4, hence the expected number of coarsened units is \({{\,\mathrm{\mathbb {E}}\,}}(p)={{\,\mathrm{\mathbb {E}}\,}}(\varPhi ^{\mathrm{T}}\iota _n)=0.4\cdot n=100\). Simulations are based on \(N=300\) replications, each of which share the same point pattern and design matrix X. Models are fitted through the cross-entropy algorithm for noisy functions (Rubinstein and Kroese 2004) as implemented in Bee et al. (2017), but for the number of draws (denoted by N in Bee et al. 2017) which equals 200 for the first iteration and 100 for subsequent iterations instead of 1000 for all iteration as suggested in Bee et al. (2017). Simulations have been performed by means of the software R (R Core Team 2020), whereas cross-entropy optimisation has been carried out through the R package noisyCE2 (Santi 2020);

  2. (B)

    the same simulation settings as in point (A), except that \(\rho =0.3\);

  3. (C)

    the same simulation settings as in point (A), except that \(\rho =0.7\);

  4. (D)

    the same simulation settings as in point (A), except that \(\sigma ^2=2\);

  5. (E)

    the same simulation settings as in point (A), except that \(n=500\) and \(\kappa (x)=\mathbb {1}_{\{x\le \sqrt{1/8}\}}\). Function \(\kappa \) has been redefined so that the average number of neighbours per unit is the same as in case (A);

  6. (F)

    the same simulation settings as in point (A), except that \(n=1000\) and \(\kappa (x)=\mathbb {1}_{\{x\le 0.25\}}\). Function \(\kappa \) has been redefined so that the average number of neighbours per unit is the same as in case (A);

  7. (G)

    the same simulation settings as in point (A), except that \(\phi (s)\propto 0.8\,\lambda (s)\). Function \(\phi \) is set so that the coarsening probability ranges between 0.2 and 0.75, whereas its average equals 0.4, in line with all the other simulation scenarios;

  8. (H)

    the same simulation settings as in point (A), except that \(\phi (s)\propto -0.8\,\lambda (s)\). Function \(\phi \) is set so that the coarsening probability ranges between 0.04 and 0.60, whereas its average equals 0.4, in line with all the other simulation scenarios;

  9. (I)

    the same simulation settings as in point (A), except that each unit of the point pattern is independently coarsened with probability 0.1 (instead of 0.4), hence the expected number of coarsened units is \({{\,\mathrm{\mathbb {E}}\,}}(p)={{\,\mathrm{\mathbb {E}}\,}}(\varPhi ^{\mathrm{T}}\iota _n)=0.1\cdot n=25\);

  10. (J)

    the same simulation settings as in point (A), except that each unit of the point pattern is independently coarsened with probability 0.2 (instead of 0.4), hence the expected number of coarsened units is \({{\,\mathrm{\mathbb {E}}\,}}(p)={{\,\mathrm{\mathbb {E}}\,}}(\varPhi ^{\mathrm{T}}\iota _n)=0.2\cdot n=50\);

  11. (K)

    the same simulation settings as in point (A), except that each unit of the point pattern is independently coarsened with probability 0.8 (instead of 0.4), hence the expected number of coarsened units is \({{\,\mathrm{\mathbb {E}}\,}}(p)={{\,\mathrm{\mathbb {E}}\,}}(\varPhi ^{\mathrm{T}}\iota _n)=0.6\cdot n=150\);

  12. (L)

    the same simulation settings as in point (A), except that the sides of hexagons measure 1, thus the number of regions is \(R=29\).

For each scenario five estimation methods are considered:

  • the maximum likelihood estimator based on a dataset where locations of all units are known, and there is no coarsening. Hereinafter this estimator is referred to as NCM, which stands for non-coarsened model;

  • the proposed estimator based on double marginalisation (hereinafter DME);

  • the maximum likelihood estimator of the SAR based only on non-coarsened units (hereinafter PDM, which stands for purged data model). In this case the weight matrix is computed using the same \(\kappa \) function as the data generating process, but no standardisation is performed;

  • the maximum likelihood estimator of the SAR based only on non-coarsened units. Unlike the previous case, the spatial weight matrix is row-standardised (hereinafter SPDM, which stands for standardised PDM);

  • the maximum likelihood estimator of the SAR based on all points. Location of coarsened points is imputed to the centroids of regions where points are located, and a row-standardised weight matrix is derived according to the same \(\kappa \) function as the data generating process. Hereinafter, this method is referred to as CIP, which stands for centroid imputed position.

Clearly, the NCM estimates are obtained under the ideal condition of no-uncertainty about unit locations, they are thus expected to be the most efficient amongst the others considered in the simulations.

PDM and SPDM are only based on correctly-georeferenced units, however all the information about dependent and independent variables is lost for coarsened units, which are not involved in the estimation process. This results in smaller effective sample size, in an alteration of part of the elements of the spatial weight matrix W due to row-standardisation, as well as in an alteration of the dependence structure amongst all units induced by the inversion of matrix \(I_n-\rho W\).

On the other hand, the CIP estimates use all the information about dependent and independent variables, whereas the imputation of the unit locations typically alters the actual neighbourhood relations both amongst coarsened units, and between coarsened and non-coarsened units.

DME is compared to all the estimation methods in order to verify whether it produces estimates which are more efficient than those obtained through PDM, SPDM and CIP. NCM estimates are used as the high benchmark.

Results of simulations are summarized in terms of relative root mean squared error (RRMSE) and relative bias in Tables 12 and 3. For reasons of space, impacts estimates about the first regressor only are reported, since estimates on other regressors’ impacts are similar.

Table 1 Relative root mean squared error and relative bias (in parenthesis) of parameter and impact estimators for scenarios A, B, C, D of Monte Carlo simulations (see Sect. 4) for various estimation methods
Table 2 Relative root mean squared error and relative bias (in parenthesis) of parameter and impact estimators for scenarios E, F, G, H of Monte Carlo simulations (see Sect. 4) for various estimation methods
Table 3 Relative root mean squared error and relative bias (in parenthesis) of parameter and impact estimators for scenarios I, J, K, L of Monte Carlo simulations (see Sect. 4) for various estimation methods

As expected, in all scenarios the NCM estimator is the best performer for all parameters both in terms of bias and RMSE, and it is not commented in the following if not for the purpose of making comparisons to other estimation approaches.

Estimates in Tables 1, 2 and 3 show two general results which basically hold under all scenarios.

Firstly, the estimates obtained from all estimation methods are rather stable under all simulation settings for most parameters and impacts. The only remarkable exception is represented by the estimates of the error variance, which are rather sensitive with respect to the value of parameter \(\rho \) and \(\sigma ^2\).

Secondly, the rank of estimation methods in terms of both bias and RMSE is basically the same whatever the scenario we consider, although some differences emerge amongst parameters.

If covariate coefficients are considered (that is \(\beta _0\), \(\beta _1\), \(\beta _2\)), DME estimator is the best performer in terms of relative bias. On the other hand, the CIP estimator exhibits the smallest RRMSE, followed by the SPDM estimator, whereas larger RRMSEs result from DME and PDM estimator. Anyway, both in case of relative bias and RRMSE, differences amongst estimators are rather small, if we consider covariates coefficients \(\beta _1\) and \(\beta _2\), whereas larger variability emerges for \(\beta _0\).

Things change if the autoregressive parameter \(\rho \) is considered. In this case, the DME clearly outperforms all other estimators both in terms of relative bias and RRMSE in all considered scenarios, whereas the second-best estimator is SPDM estimator followed by CIP and PDM estimators. Unlike regressors coefficients, differences amongst estimation methods are large in terms of relative bias and RRMSE.

If the error dispersion parameter \(\sigma \) is considered, the four estimation methods for coarsened data can be gathered into two groups. The former includes the best performers which are DME and SPDM, the latter consists in CIP and PDM estimators, which almost double the relative bias and the RRMSE of estimators in the other group. Interestingly, estimators of each group exhibit very similar relative bias and RRMSE.

The performances of estimators on assessing impacts of covariates clearly reflect the statistical performances on parameters \(\rho \), \(\beta _1\), and \(\beta _2\). Thus CIP, PDM, and SPDM estimators perform well in estimating the direct impact, whereas the DME definitely outperforms the others when indirect impact is estimated. The efficiency of DME on indirect impact estimation is large enough to make DME the most efficient estimator also for the total impact. Analogous results hold also in terms of bias.

Although relative performances of estimators are pretty stable amongst scenarios considered in the simulations, it is worth analysing more in depth the results of the simulations.

The comparison between results of scenarios B, A and C enables to assess the effect of parameter \(\rho \) on the estimators, which turns out to be marked for all parameters and most estimation methods. In particular, as \(\rho \) gets larger (note that \(\rho _{(B)}=0.3\), \(\rho _{(A)}=0.5\), \(\rho _{(C)}=0.7\)), both the relative bias (in absolute value) and the RRMSE of \(\hat{\rho }\) decrease for all estimation methods except for PDM, whereas the absolute value of the relative bias and the RRMSE grow for all the other parameters as \(\rho \) gets larger. The magnitude of relative bias and the RMSE of impact estimators grow as well with \(\rho \) for all estimation methods with the exception of the indirect impact of PDM which is basically independent of \(\rho \).

The effect of \(\sigma ^2\) can be assessed by comparing scenario A (where \(\sigma =1\)) and scenario D (where \(\sigma =2\)). The increase in the variance error leads to greater RRMSE for both the regressor coefficients and the direct impacts, whereas the RRMSE of the autoregressive parameter \(\rho \) and other impacts (indirect and total) remain basically unchanged. On the other hand, both the relative bias and the RRMSE of \(\hat{\sigma }\) improve substantially.

It is worth noting that the relative bias and the RRMSE of the estimators of the autoregressive parameter \(\rho \) and the error dispersion parameter \(\sigma \) are associated. In particular, the larger \((\hat{\rho }-\rho )/\rho \), the smaller the relative bias and the RRMSE of \(\hat{\sigma }\). Such a relation is fully consistent with with the relation in Eq. (2b), where the negative bias of \(\hat{\rho }\) entails a positive bias of \(\hat{\sigma }\).

An analogous relation emerges for the regression coefficients \(\beta _0\), \(\beta _1\), \(\beta _2\), according to Eq. (2a), however in this case the magnitude of the effect is not particularly marked, probably due to the fact that Eq. (2a) is linear in \(\hat{\rho }\), unlike Eq. (2a), which is quadratic.

The effect of the sample size n emerges if the results of scenario A (\(n=250\)), E (\(n=500\)) and F (\(n=1000\)) are compared. As expected, both the relative bias, and the RRMSE of all estimators of non-coarsened model (NCM) gets smaller (in absolute value) as n increases. On the other hand, none of the other estimators show a similar pattern, as neither the relative bias nor the RRMSE exhibits any convergent trend.

If scenarios A (\(\phi (s)=0.4\)), G (\(\phi (s)\propto 0.8\,\lambda (s)\)), and H (\(\phi (s)\propto -0.8\,\lambda (s)\)) are compared, no clear pattern emerges, although it seems that RRMSE tends to slightly increase as we move from scenario G to A, and from A to H, suggesting that better estimates can be obtained if coarsening is more frequent in areas where the intensity of the point process is higher (scenario G), whereas the opposite is true if the intensity of the point process and the coarsening probability are inversely related (scenario H).

The comparison between scenarios A, I, J and K allows one to assess the effect of the proportion of coarsened units on the estimators. The coarsening function \(\phi \) of all scenarios is constant over the domain S, and it is defined so that the expected share of coarsened points equals \(10\%\), \(20\%\), \(40\%\) and \(60\%\) for scenarios I, J, A and K, respectively. The results of the simulations shows that the estimators of all methods (except NCM, obviously) deteriorates in terms of efficiency as the coarsening probability gets larger. If the autoregressive parameter \(\rho \) is considered, a strong bias towards zero emerges as the coarsening probability increases, in line with the empirical and theoretical results provided in Arbia et al. (2016). Clearly, the relative biases of indirect and total impacts derive from the behaviour of the estimators for \(\rho \).

Fig. 2
figure 2

Normal quantile-quantile plot of parameter estimates from scenario A. In each plot the p-value of the Jarque-Bera normality test (Jarque and Bera 1980) has been annotated

It is worth noting that the DME estimators are more efficient and less biased than other estimation methods, however such advantage tends to diminish as the proportion of coarsened points exceeds \(50\%\). Monte Carlo simulations based on \(80\%\) of coarsened units (not presented in this paper for the sake of brevity) have shown that DME and CIP estimators have has similar performances, whereas they are about \(20\%\)\(40\%\) more efficient than other estimation methods.

Finally, if the size of the regions is reduced (scenario L vs. A), the effects of coarsening are more limited, and this turns in to an improved efficiency and relative bias for all the estimators. This is an expected result, since the loss of information due to coarsening is lower if the regions where points are located are smaller.

As for the distribution of double-marginal estimators, the goodness of fit to the Gaussian distribution is generally fairy good. See Fig. 2 for an example based on simulations from scenario A.

5 Application to hedonic models for house prices

In this section, the estimation method based on double marginalisation is applied to an hedonic model for house prices in Beijing.

The dataset used for fitting the model consists of a sample of 361 transactions made freely available by Qichen (2019) and collected through web-scraping from bj.lianjia.com. For each transaction, it is available information on sale price, the number of living and drawing rooms, the number of bathrooms, the type of building, the construction time, the type of building structure, whether the house is close to a subway station, and whether the elevator is available. Moreover, the Beijing’s district and the geographical coordinates where houses are located are available.

Fig. 3
figure 3

Map of the central districts of Beijing (Changping, Chaoyang, Daxing, Dongcheng, Fangshan, Fengtai, Haidian, Mentougou, Shijingshan, Shunyi, Tongzhou, Xicheng) and location of houses considered in the model. Position of correctly located houses is drawn with a cross, whereas (unreliable) position of coarsened locations is drawn by means of a circle

Figure 3 shows the map of points where the houses are located. A group of 65 points (about \(18\%\)) in the map has been marked differently because in those cases longitude and latitude are not consistent with the district where the house should be located. For this reason, those points are considered as coarsened.

Table 4 shows the estimates of regression coefficients and total impacts for a SAR model where the logarithm of house price is regressed over the covariates listed before. The spatial weight matrices has been defined according to the k-nearest neighbour criterion with \(k=20\). The model has been fitted through the double-marginal estimator (DME), the standardised purged data model (SPDM) and the centroid imputed position model (CIP). The NCM model has not been considered, as correct locations of coarsened units is not available, whereas the purged data model (PDM) has not been fitted since in case of equally weighted matrices W based on k-nearest neighbour criterion the SPDM and the PDM models are equivalent.

Results in Table 4 exhibit a good agreement in terms significance and sign of point estimates amongst the three estimation methods, nevertheless differences in magnitudes between several regression coefficients emerge. The estimates of the autoregressive parameter \(\rho \) are rather different from one method to another; as simulations pointed out, as we move from the DME estimator to the CIP estimator and from the latter to the SPDM estimator, the point estimates of \(\rho \) tends to get closer and closer to zero.

Table 4 Autoregressive parameter \(\rho \) and regression coefficients \(\beta _j\) estimates of a SAR model on log-transformed prices of 361 houses in Beijing

6 Conclusions

The estimation method proposed in this paper for tackling the problem of incompletely geocoded data is based on a modelling approach which integrates the point process, the coarsening process and the spatial process through a marked point process model whose likelihood function is then marginalised twice so as to clean out the effects of coarsening.

Monte Carlo simulations for the spatial autoregressive model have shown that the proposed method is basically equivalent to other methods in terms of bias and RMSE in the estimation of regressor coefficients, whereas it returns more efficient and less biased estimates for the spatial autoregressive parameter, the error variance, the indirect impacts, and the total impacts. Gains in efficiency and biasedness are substantial and they clearly emerge under the various simulation settings. The proposed methodology can be generalised in various directions to account for other forms of data incompleteness typically emerging when analysing large spatial datasets related to individual economic agents.