4 Models for areal spatial structure
This chapter is about spatial random effect model specifications for areal data. A simple model based on the adjacency structure between areas is popular in HIV small-area estimation and beyond. The analysis aimed to determine if using a more complex model would result in more accurate predictions.
Modelling spatial correlation is particularly important for small-area estimation of HIV because the covariates most strongly associated with HIV, such as sexual behaviour and STI status (Mayala, Bhatt, and Gething 2020), are difficult to measure. As a result, previous small-area models of HIV have found including covariates only modestly improve predictive performance (Supplementary Figure 20, Dwyer-Lindgren et al. 2019). The lack of predictive covariates emphasises the role of modelling spatial variation. For mapping of other infectious diseases, such as Malaria where transmission is driven by more predictive and easily measurable environmental factors, explanatory covariates are more easily available and directly modelling spatial correlation is less important (Weiss et al. 2015; Bhatt et al. 2015).
Spatial variation in areal data is often modelled using spatial random effects (Haining 2003; Cramb et al. 2018).
The most common class of models used to specify spatial random effects are Gaussian Markov random fields [GMRFs; Havard Rue and Held (2005)].
These models combine a Gaussian distribution with Markov conditional independence assumptions between areas.
Observations in areas close together are assumed to be related, with more distant relationships not directly accounted for.
Perhaps the simplest GMRF model is that of Besag, York, and Mollié (1991) in which information is borrowed equally from each adjacent area, based on a binary relationship.
The Besag model is attractive as it requires minimal additional modelling choices and is accessibly implemented in software such as R-INLA
(Blangiardo et al. 2013), rstan
(Morris et al. 2019; Donegan 2022), NIMBLE
[Chapter 9; de Valpine et al. (2023)] and PyMC
(Saunders 2023), among others.
As a result, it has been widely used, including to model bird population dynamics from capture-recapture data (Saracco et al. 2010); for the analysis of magnetic resonance images (Gössl, Auer, and Fahrmeir 2001; Schmid et al. 2006); to map mortality from cancers (Rashid et al. 2023), injuries (Parks et al. 2020), and air pollution (Bennett et al. 2019); and to model alcohol use patterns (Dwyer-Lindgren et al. 2015).
The Besag model was designed for image analysis, on a regular grid.
However, for more irregular geometries, the assumptions made are unrealistic and appear to be violated.
The administrative divisions of a country used in small-area estimation are one example of a more irregular geometry.
This chapter tests the hypothesis that using more realistic assumptions about spatial structure improves the performance of small-area estimation models.
Performance in this context refers to accurate forecasts of parameters as measured by scoring rules.
In doing so, practical recommendations for modelling areal spatial structure are offered.
Code for the analysis in this chapter is available from https://github.com/athowes/beyond-borders
, and supported by the arealutils
R package (Howes 2023a).
4.1 Models based on adjacency
This section discusses spatial random effect models based on a symmetric adjacency relation \(i \sim j\) between areas \(A_i\) and \(A_j\). Adjacency is typically defined by a shared border, though other choices are possible (Christopher J. Paciorek et al. 2013).
4.1.1 The Besag model
The Besag model (Besag, York, and Mollié 1991) is an improper conditional auto-regressive (ICAR) model where the conditional mean of the random effect \(u_i\) is the average of its neighbours \(\{u_j\}_{j \sim i}\) and the precision is proportional to the number of neighbours. The full conditional distribution of the \(i\)th spatial random effect is given by \[\begin{equation} u_i \, | \, \mathbf{u}_{-i} \sim \mathcal{N} \left(\frac{1}{n_{\delta i}} \sum_{j: j \sim i} u_j, \frac{1}{n_{\delta i}\tau_u}\right), \tag{4.1} \end{equation}\] where \(\delta i\) is the set of neighbours of \(A_i\) with cardinality \(n_{\delta i} = |\delta i|\) and \(\mathbf{u}_{-i}\) is the vector of spatial random effects with the \(i\)th entry removed. By Brook’s lemma (Havard Rue and Held 2005) the set of full conditionals of the Besag model is equivalent to the Gaussian Markov random field (GMRF) given by \[\begin{equation} \mathbf{u} \sim \mathcal{N}(\mathbf{0}, \tau_u^{-1} \mathbf{R}^{-}). \tag{4.2} \end{equation}\] The matrix \(\mathbf{R}^{-}\) is the generalised inverse of the rank-deficient structure matrix \(\mathbf{R}\) with entries \[\begin{equation} R_{ij} = \begin{cases} n_{\delta i}, & i = j \\ -1, & i \sim j \\ 0, & \text{otherwise.} \end{cases} \end{equation}\] The Markov property arises due to the conditional independence structure \(p(u_i \, | \, \mathbf{u}_{-i}) = p(u_i \, | \, \mathbf{u}_{\delta i})\) whereby each area only depends on its neighbours. This is reflected in the sparsity of \(\mathbf{R}\) such that \(u_i \perp u_j \, | \, \mathbf{u}_{-ij}\) if and only if \(R_{ij} = 0\). The structure matrix \(\mathbf{R}\) may also be expressed as the Laplacian matrix of the adjacency graph \(\mathcal{G} = (\mathcal{V}, \mathcal{E})\) with vertices \(v \in \mathcal{V}\) corresponding to each area and edges \(e \in \mathcal{E}\) between vertices \(i\) and \(j\) when \(i \sim j\). Figure 4.1 shows the districts of Zimbabwe with corresponding adjacency graph.
Rewriting Equation (4.2), the probability density function of \(\mathbf{u}\) is \[\begin{equation} p(\mathbf{u}) \propto \tau_u^{\frac{n - n_c}{2}} \times \exp \left( -\frac{\tau_u}{2} \mathbf{u}^\top \mathbf{R} \mathbf{u} \right) \propto \exp \left( -\frac{\tau_u}{2} \sum_{i \sim j} (u_i - u_j)^2 \right). \tag{4.3} \end{equation}\] This density is a function of the pairwise differences \(u_i - u_j\) and so is invariant to the addition of a constant \(c\) to each entry \(p(\mathbf{u}) = p(\mathbf{u} + c\mathbf{1})\). As a result, there is an improper uniform distribution on the average of the \(u_i\). If \(\mathcal{G}\) is connected, in that by traversing the edges, any vertex can be reached from any other vertex, then there is only one impropriety in the model and \(\text{rank}(\mathbf{R}) = n - 1\), while if \(\mathcal{G}\) is disconnected, and composed of \(n_c \geq 2\) connected components with index sets \(I_1, \ldots, I_{n_c}\), then the corresponding structure matrix \(\mathbf{R}\) has rank \(n - n_c\) and the density is invariant to the addition of a constant to each of the connected components \(p(\mathbf{u}_{I}) = p(\mathbf{u}_{I} + c\mathbf{1})\) where \(I = I_1, \ldots, I_{n_c}\).
4.1.2 Best practises for the Besag model
Directly using the Besag model as described in Section 4.1.1 has several practical limitations in applied settings. To overcome these limitations, Freni-Sterrantino, Ventrucci, and Rue (2018) recommend three best practices:
The structure matrix \(\mathbf{R}\) should be rescaled to have generalised variance equal to one. The generalised variance of a matrix is defined by the geometric mean of the diagonal elements of its generalised inverse. For the structure matrix that is \[\begin{equation} \sigma^2_{\text{GV}}(\mathbf{R}) = \prod_{i = 1}^n (\mathbf{R}^-_{ii})^{1/n} = \exp \left( \frac{1}{n} \sum_{i = 1}^n \log (R^-_{ii}) \right). \end{equation}\] The scaled structure matrix \(\mathbf{R}^\star\) is given by \[\begin{equation} \mathbf{R}^\star = \mathbf{R} / \sigma^2_{\text{GV}}(\mathbf{R}). \end{equation}\] As the diagonal elements \(R^-_{ii}\) correspond to marginal variances, the generalised variance gives a measure of the average marginal variance. This measure, introduced by Sørbye and Rue (2014), ignores off-diagonal entries. More broadly, other measures of typical variance could be used.
Scaling mitigates the influence of the adjacency graph on the variance of \(\mathbf{u}\). For consistent and interpretable prior distribution selection, it is important to allow the variance to be controlled by \(\tau_u\) alone.
When the adjacency graph is disconnected it is not appropriate to scale the structure matrix \(\mathbf{R}\) uniformly. This is because, given the precision \(\tau_u\), local smoothing operates on each connected component independently. As such, each connected component \(I = I_1, \ldots, I_{n_c}\) should be scaled independently to have generalised variance one \[\begin{equation} \mathbf{R}^\star_I = \mathbf{R}_I / \sigma^2_{\text{GV}}(\mathbf{R}_I) \end{equation}\] where \(\mathbf{R}_I\) is the sub-matrix of the structure matrix corresponding to index set \(I\).
When one of the connected components is a single area, known either as a singleton or an island, the probability density \[\begin{equation} p(u_i) \propto \exp \left( -\frac{\tau_u}{2} \sum_{i \sim j} (u_i - u_j)^2 \right) \propto 1 \end{equation}\] has no dependence on \(u_i\). This is equivalent to using an improper prior. To avoid this, each singleton should be set to have independent Gaussian noise \(p(u_i) \sim \mathcal{N}(0, \tau_u^{-1})\).
To avoid confounding of the spatial random effects with the intercept, it is recommended to place a sum-to-zero constraint on each non-singleton connected component. In other words, \[\begin{equation} \sum_{i \in I} u_i = 0, \quad |I| > 1. \end{equation}\] As such, in total the number of sum-to-zero constraints equals to the number of non-singleton connected components.
4.1.3 Concerns about the Besag model
The Besag model was originally proposed by Besag, York, and Mollié (1991) for use in image analysis. In this setting, areas correspond to pixels arranged in a regular lattice structure. In an image, the data point at each pixel can be thought of as an average of the intensity or colour over the space represented by the pixel.
Since its original proposal, the Besag model has seen wider use. However, for small-area estimation of HIV, the spatial structure corresponds to administrative units. These administrative units may have a more irregular spatial structure than a lattice. Furthermore, data points may not come about by uniform averaging over a space. For example, population density may vary across the area.
These considerations raise concerns about the Besag model’s applicability to the small-area estimation setting, which we explore in this section. The discussion is closely linked to the modifiable areal unit problem (Openshaw and Taylor 1979), whereby statistical conclusions change as a result of seemingly arbitrary changes in data aggregation, and the challenge of ecological inference and the ecological fallacy (Jonathan Wakefield and Lyons 2010).
4.1.3.1 Compression to adjacency
A fundamental objection is that summarising a geometry by an adjacency graph represents a loss of information. Many geometries share the same adjacency graph, and as such are isomorphic under the Besag model (Figure 4.2). Though not in itself a problem, this fact prompts consideration whether the class of geometries with the same adjacency graph is sufficiently similar to merit identical models.
Intuitively, the more regular the spatial structure, the less information is lost in compression to an adjacency graph. In image analysis, very little spatial information is lost in compression of a lattice structure to an adjacency graph. On the other hand, the regions of a country, determined by political and geographic forces, tend to display greater irregularity. The appropriateness of adjacency compression varies therefore by the type of geometry common to the application setting.
The regularity of realistic geometries may help to constrain each class to be similar. In other words, although pathological geometries can be constructed, they might be implausible in statistical practice and so of limited concern.
4.1.3.2 Mean structure
In the Besag model all adjacent areas count equally in the equation for the conditional mean. This assumption is unsatisfying, as for most geometries we expect different amounts of correlation between neighbouring areas. Figure 4.2 illustrates a number of heuristic features for neighbour importance. In Panel 4.2C, the area with a longer shared border would be expected to be more highly correlated. In Panel 4.2D, the area with a closer centre would be expected to be more highly correlated.
4.1.3.3 Variance structure
In Equation (4.1) the precision of \(u_i\) is proportional to its number of neighbours \(n_{\delta i}\). It follows that as \(n_{\delta i} \to \infty\) then \(\text{Var}(u_i) \to 0\). This is illustrated by Figure 4.3 where the area on the right is repeatedly divided such that its number of neighbours increases. This property is a consequence of averaging the conditional mean over a greater number of areas, which, in certain situations, can correspond to a greater amount of information. However, if the amount of information in the shaded area remains fixed, it is inappropriate that \(\text{Var}(u_1)\) should tend to zero as a result of drawing additional, arbitrary, boundaries. In the image analysis setting this modelling assumption is reasonable: each pixel represents a fixed amount of information and a higher pixel density represents a greater amount of information. On the other hand, in public health and epidemiology, drawing boundaries to create additional areas is not expected to correspond to a greater amount of information.
As a second example of undesirable behaviour, suppose we fit a Besag model upon identical data using each of the two geometries in Figure 4.4. If the spatial variation is relatively smooth, dividing the shaded areas into two will result in a lower estimated variance \(\sigma^2_u\) in Panel 4.4B as compared with Panel 4.4A because there will appear to be less variation between neighbouring areas. This problem does not only apply locally: since the effect of \(\sigma^2_u\) applies everywhere, the smoothing will change even in unaltered parts of the study region.
4.1.4 Weighted ICAR models
The Besag model is a special case of a more general class of (zero-mean) weighted ICAR models. These models can be specified in terms of scaled weights \(\{b_{ij}\}_{j \sim i}\) and a precision vector \(\boldsymbol{\mathbf{\kappa}} = (\kappa_i)_{i \in [n]}\). The full conditionals are then \[\begin{equation} u_i \, | \, \mathbf{u}_{-i} \sim \mathcal{N} \left( \sum_{j: j \sim i} b_{ij} u_j, \frac{1}{\kappa_i \tau_u} \right). \tag{4.4} \end{equation}\] Setting \(b_{ij} = 1 / n_{\delta i}\) and \(\kappa_i = n_{\delta i}\) recovers the Besag model in Equation (4.1). The structure matrix \(\mathbf{R}\) corresponding to the more general full conditionals in Equation (4.4) is \[\begin{equation} \mathbf{R} = \mathbf{D}_\kappa(\mathbf{I} - \mathbf{B}), \end{equation}\] where the unscaled weights matrix \(\mathbf{B}\) has elements \[\begin{equation} \mathbf{B}_{ij} = \begin{cases} b_{ij}, & \text{for } i \sim j, \\ 0, & \text{for } i = j, i \nsim j. \end{cases}, \end{equation}\] and the matrix \(\mathbf{D}_\kappa\) is given by \(\text{diag}(\kappa_1, \ldots, \kappa_n)\). Ensuring that the structure matrix is symmetric requires that for all \(i, j \in [n]\) \[\begin{equation} - b_{ij} \kappa_i = - b_{ji} \kappa_j. \end{equation}\] To meet this condition, it can be simpler to directly consider symmetry of the unscaled weights matrix \[\begin{equation} \mathbf{W} = \mathbf{D}_\kappa \mathbf{B}, \end{equation}\] such that \(\mathbf{R} = \mathbf{D}_\kappa - \mathbf{W}\). For the Besag model the unscaled weights matrix \(\mathbf{W}\) corresponds to the adjacency matrix. Scaled weights can be recovered by \(b_{ij} = w_{ij} / \kappa_i\) where \(\kappa_i = \sum_{k: k \sim i} w_{ik}\). Duncan, White, and Mengersen (2017) provide discussion of methods for specifying \(\mathbf{W}\), including \[\begin{align} w_{ij} &= \left( \frac{1}{d_{ij}} \right), \\ w_{ij} &= \exp (-d_{ij}). \end{align}\] Weighted ICAR models appear to overcome some of the limitations discussed in Section 4.1.3.
4.1.5 The reparameterised Besag-York-Mollié model
Often, as well as spatial correlation, there exists IID over-dispersion in the residuals and it is inappropriate to use purely spatially structured random effects in the model. The Besag-York-Mollié (BYM) model of Besag, York, and Mollié (1991) accounts for this in a natural way by decomposing the spatial random effect \(\mathbf{u} = \mathbf{v} + \mathbf{w}\) into a sum of an unstructured IID component \(\mathbf{v}\) and a spatially structured Besag component \(\mathbf{w}\). Each component has its own respective precision parameter \(\tau_v\) and \(\tau_w\). The resulting distribution is \[\begin{equation} \mathbf{u} \sim \mathcal{N}(0, \tau_v^{-1} \mathbf{I} + \tau_w^{-1} \mathbf{R}^{-}) \tag{4.5}. \end{equation}\] Including both \(\mathbf{v}\) and \(\mathbf{w}\) is intended to enable the model to learn the relative extent of the unstructured and structured components via \(\tau_v\) and \(\tau_w\).
However, in the BYM model, scaling of the Besag precision matrix \(\mathbf{Q}\) is not taken into account despite this issue being particularly pertinent when dealing with multiple sources of noise. In particular, placing a joint prior distribution \[\begin{equation} (\tau_v, \tau_w) \sim p(\tau_v, \tau_w) \end{equation}\] which does not privilege either component is more easily accomplished if \(\mathbf{Q}\) and \(\mathbf{I}\) have the same scale. Additionally, supposing one has a prior belief that the over-dispersion is primarily IID and \(\mathbf{v}\) accounts for the majority of the dispersion, then it is not immediately obvious how to represent this belief, without inadvertently altering the prior distribution on the amount of overall variation. This highlights identifiability issues of the parameters \((\tau_v, \tau_w)\) resulting from their non-orthogonality.
Building on the models of Leroux, Lei, and Breslow (2000) and Dean, Ugarte, and Militino (2001) which tackle this identifiability problem, but do not scale the spatially structured noise, Simpson et al. (2017) propose a reparameterisation \((\tau_v, \tau_w) \mapsto (\tau_u, \phi)\) of the BYM model. This is known as the BYM2 model and given by \[\begin{align} \mathbf{u} = \frac{1}{\tau_u} \left( \sqrt{1- \phi} \, \mathbf{v} + \sqrt{\phi} \, \mathbf{w}^\star \right), \tag{4.6} \end{align}\] where \(\tau_u\) is the marginal precision of \(\mathbf{u}\), \(\phi \in [0, 1]\) gives the proportion of the marginal variance explained by each component, and \(\mathbf{w}^\star\) is a scaled version of \(\mathbf{w}\) with precision matrix given by the scaled structure matrix \(\mathbf{R}^\star\). When \(\phi = 0\) the random effects are IID, and when \(\phi = 1\) the random effects follow the Besag model. To borrow an analogy (Håvard Rue 2020) the parameterisation \((\tau_v, \tau_w)\) is like having one hot water and one cold water tap, whereas the parameterisation \((\tau_u, \phi)\) is like a mixer tap where the amount of water and its temperature can be adjusted separately.
Although the BYM and BYM2 models were originally proposed using the Besag model as the spatially structured component, this need not be the case. Indeed, more broadly it is reasonable to consider convolved random effects (of a form analogous to that in Equation (4.5) or (4.6)) with any model for spatially structured noise. Any limitations of the model for spatially structured random effects are inherited by the convolved random effects.
4.2 Models using kernels
Section 4.1 reviewed ways to construct spatial random effect precision matrices using an adjacency relation. An alternate approach is to define the covariance matrix using an areal kernel function which gives a measure of similarity between two areas. Such a function may be specified as \[\begin{equation} K: \mathcal{P}(\mathcal{S}) \times \mathcal{P}(\mathcal{S}) \to \mathbb{R}, \tag{4.7} \end{equation}\] where \(\mathcal{P}\) denotes the power set such that \(\mathcal{P}(\mathcal{S})\) is the space of subsets of the study region. If the function \(K\) is positive semi-definite, then define areal kernel spatial random effects by \[\begin{equation} \mathbf{u} \sim \mathcal{N} \left( 0, \frac{1}{\tau_u} \mathbf{K} \right), \tag{4.8} \end{equation}\] where the \(n \times n\) Gram matrix \(\mathbf{K}\) with entries \(K_{ij} = K(A_i, A_j)\) is a valid covariance matrix. The precision parameter \(\tau_u\) is placed outside of the Gram matrix, analogous to the relation of the precision and structure matrices, but could be omitted. Areal kernels may be thought of as a type of kernels on sets (Gärtner et al. 2002).
It is challenging to think directly about the correlation structure between areas. Instead, most well-known spatial process models define the correlation structure between points using a kernel function \[\begin{equation} k: \mathcal{S} \times \mathcal{S} \to \mathbb{R}. \tag{4.9} \end{equation}\] A simple method, and the one considered here henceforth, is to construct \(K\) (Equation (4.7)) from \(k\) (Equation (4.9)) by averaging the kernel \(k\) computed on some number of points representing each area. In Section 4.2.1 one point is used, and in Section 4.2.2 multiple points are used.
4.2.1 Centroid kernel
The simplest approach is to use a single point to represent each area such that \[\begin{equation} K(A_i, A_j) = k(p_i, p_j). \end{equation}\] A natural choice is the centroid \(p_i = c_i\), given by the arithmetic mean of the latitude and longitude. (Note that it is not guaranteed for the centroid to lie within the area i.e. it is possible \(c_i \notin A_i\), and more generally points representing an area may not be contained by that area.) This choice results in the centroid kernel \[\begin{equation} K(A_i, A_j) = k(c_i, c_j). \tag{4.10} \end{equation}\] The centroid kernel has been used in environmental epidemiology (J. Wakefield and Morris 1999), for US election modelling (Flaxman, Wang, and Smola 2015), and to model the reproduction number of COVID-19 (Teh et al. 2022). In a model comparison study Nicky Best, Richardson, and Thomson (2005) (Section 3) simulated data representing heterogeneous exposure to air pollution, including elevated rates of exposure near two hypothetical point source locations, and found that the centroid kernel tended to over-smooth the high-risk areas. That said, it is unsurprising that a stationary covariance function would struggle to recover non-stationary structure.
4.2.2 Integrated kernel
Rather than choosing a single representative point, an alternative is to more completely represent the area by integrating the kernel over the areas of interest (Kelsall and Wakefield 2002; Follestad and Rue 2003). This results in the integrated kernel \[\begin{equation} K(A_i, A_j) = \frac{1}{|A_i||A_j|} \int_{A_i} \int_{A_j} k(s, s') \text{d} s \text{d} s'. \tag{4.11} \end{equation}\] Unlike for the centroid kernel where \(K_{ii} = 1\) for all \(i\), the marginal variance of the \(i\)th spatial random effect \(K_{ii} = K(A_i, A_i)\) varies depending on the area: becoming smaller for more compact areas and larger for areas which are of greater extent or more spread out.
This covariance structure is equivalent to that obtained by aggregating a spatially continuous Gaussian process with kernel \(k\) over the areal partition. In the machine learning literature, models of this kind have been studied under the name aggregated Gaussian processes (Law et al. 2018; Tanaka et al. 2019; Yousefi, Smith, and Alvarez 2019; Hamelijnck et al. 2019; Chau, Bouabid, and Sejdinovic 2021). Examples of use of this model in statistical practice are rare.
4.2.2.1 Accounting for heterogeneity
Additional information accounting for heterogeneity over \(A_i\) may be incorporated into the integrated kernel. This can be accomplished using weighting distributions \(\{w_i\}\) which represent an unequal contribution of each point to the similarity measure. The weighted integrated kernel is given by \[\begin{equation} K(A_i, A_j) = \frac{1}{|A_i||A_j|}\int_{A_i} \int_{A_j} w_i(s) w_j(s') k(s, s') \text{d} s \text{d} s'. \tag{4.12} \end{equation}\] This areal kernel may be useful in disease mapping. For example, areas with populations who live close to a shared border are likely to be more strongly correlated than areas whose populations live far apart. This detail could be accounted for by weighting according to a high resolution measure of population density. Though e.g. weighted centroids may also be used in Equation (4.10), accounting for heterogeneity over an area is more natural within the integrated kernel than the centroid kernel.
4.2.2.2 Computation
Most of the time it is not possible to calculate Equation (4.12) analytically. Instead, consider \(n\) collections of \(L_i\) samples \(\{s^{(i)}_l \}_{l = 1}^{L_i} \sim \mathcal{U}(A_i)\) drawn uniformly from each area. Then the integral may be approximated using Monte Carlo by the double sum \[\begin{equation} K(A_i, A_j) \approx \frac{1}{L_i L_j} \sum_{l = 1}^{L_i} \sum_{m = 1}^{L_j} w_i \left( s^{(i)}_l \right) w_j \left( s^{(j)}_m \right) k \left( s^{(i)}_l, s^{(j)}_m \right). \tag{4.13} \end{equation}\] Equivalently, samples drawn from \(W_i\) may be used without weighting by \(w_i(s)\). Nodes may also be selected deterministically to give a numerical quadrature estimate of the kernel. Figure 4.5 shows three possible ways of choosing points \(s^{(i)}_l\), together with the centroids approach.
Computing the \(n \times n\) Gram matrix \(K\) requires \[\begin{equation} \mathcal{O}(\sum_{i = 1}^n \sum_{j = 1}^n L_i L_j) \end{equation}\] evaluations of the kernel \(k\). This imposes a significant computational cost if the Gram matrix is often recomputed during inference. For example, during MCMC when the kernel has hyperparameters which are learnt then the Gram matrix is recomputed for each proposed set of hyperparameters. As such, there is a limit on the size of \(L_i\) which it is feasible to use. Kelsall and Wakefield (2002) encounter this challenge, and take the approach of using a discrete hyperparameter prior to reduce the number of Gram matrix constructions and inversions required.
4.2.2.3 Connection to log-Gaussian Cox processes
The log-Gaussian Cox Process framework (Diggle et al. 2013) arrives naturally at the integrated kernel formulation (Li et al. 2012). A Cox process is an inhomogeneous Poisson process with a continuous stochastic intensity function \(\{ x(s), s \in \mathcal{S} \}\) such that conditional on the realisation of \(x(s)\) the number of points in any area \(A_i\) follows a Poisson distribution. The rate parameter of this Poisson distribution is explicitly aggregated as follows \[\begin{equation} y_i \, | \, x(s) \sim \text{Poisson} \left(\int_{s \in A_i} x(s) \text{d}s \right). \end{equation}\] In a LGCP the log intensity \(\log x(s) = \eta(s)\) is modelled using a Gaussian process prior \(\eta(s) \sim \mathcal{GP}(\mu(s), k(s, s'))\). O. Johnson, Diggle, and Giorgi (2019) obtain Equation (4.12) by considering a discrete Poisson log-linear mixed model approximation to a continuous LGCP, whereby \(\eta(s)\) is approximated by a piecewise constant \(\eta_i = \mu_i + u_i\) in each area \(A_i\). The \(i\)th discrete spatial random effect is then \(u_i = \int_{A_i} w_i(s) u(s) \text{d}s\), with covariance structure \[\begin{equation} \text{Cov} \left( \int_{A_i} w_i(s) u(s) \text{d}s, \int_{A_j} w_j(s') u(s') \text{d}s' \right) = \int_{A_i} \int_{A_j} w_i(s) w_j(s') k(s, s') \text{d}s\text{d}s', \end{equation}\] corresponding to an areal integrated kernel with a logarithmic link function and Poisson likelihood.
4.2.2.4 Connection to disaggregation regression
Disaggregation regression, also known as downscaling or interpolation, is another closely related approach. Rather than focusing on the aggregate nature of areal observations as a route towards better area-level estimates, disaggregation regression aims to produce high-resolution or point-level estimates from areal observations (Utazi et al. 2019; Arambepola et al. 2022; Nandi et al. 2023). These two tasks are similar, and indeed it could be argued that accurate point-level estimates are a necessary intermediate step towards accurate area-level estimates. However, disaggregation regression is challenging without auxiliary covariate information, and therefore unlikely to be applicable to small-area estimation of HIV.
4.3 Simulation study
This simulation study tests the ability of inferential models with varying spatial random effect specifications to accurately recover small-area quantities. The data and modelling choices were designed with a spatial epidemiology application in mind.
4.3.1 Synthetic data
Model | Details |
---|---|
IID | \(\mathbf{u} \sim \mathcal{N}(0, \mathbf{I}_n)\) |
Besag | \(\mathbf{u} \sim \mathcal{N}(0, {\mathbf{R}^\star}^{-})\) as in Section 4.1.1 |
Integrated kernel (IK) | \(\mathbf{u} \sim \mathcal{N}(0, \mathbf{K}^\star)\) as in Section 4.2.2 with Matérn kernel, \(\nu = 3/2, l = 2.5\) and \(L_i = 100\) points per area |
Data \(\mathbf{y} = (y_i)_{i \in [n]}\) were simulated from a binomial likelihood \(y_i \sim \text{Bin}(m_i, \rho_i)\). The probabilities \(\rho_i \in [0, 1]\) were linked to linear predictors \(\eta_i \in \mathbb{R}\) via \[\begin{equation} \log \left( \frac{\rho_i}{1 - \rho_i} \right) = \eta_i = \beta_0 + u_i, \quad i \in [n]. \end{equation}\] Spatial random effects were generated according to three different models (Table 4.1). Sample sizes were fixed as \(m_i = 25\) for all \(i \in [n]\), the intercept parameter as \(\beta_0 = -2\) and the spatial random effect precision parameter as \(\tau_u = 1\).
Seven geometries were considered (Figure 4.6). These included the four vignette geometries from Figure 4.2 which share an adjacency graph. Three more realistic geometries were included to represent plausible variation over spatial regularity for the small-area estimation setting. From the most to the least spatially regular, these geometries were: a \(6 \times 6\) lattice grid; the 33 districts of Côte d’Ivoire; and the 36 congressional districts of Texas. For each of the three spatial random effect models and seven geometries 250 synthetic data were generated, resulting in a total of 5250 synthetic data.
4.3.2 Inferential models
Model | Details |
---|---|
IID | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} \mathbf{I}_n)\) |
Besag | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} {\mathbf{R}^\star}^{-})\) as in Section 4.1.1 |
BYM2 | \(\mathbf{u} = \tau_u^{-1} ( \sqrt{1 - \pi} \, \mathbf{v} + \sqrt{\pi} \, \mathbf{w}^\star)\) as in Section 4.1.5 with \(\pi \sim \text{Beta}(0.5, 0.5)\) |
FCK | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} \mathbf{K})\) with \(K_{ij} = k(c_i, c_j)\) as in Section 4.2.1 with fixed length-scale \(l\) |
CK | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} \mathbf{K})\) with \(K_{ij} = k(c_i, c_j)\) as in Section 4.2.1 with length-scale prior distribution \(l \sim \mathrm{Inv{\text-}Gamma}(a, b)\) with \(a, b\) set based on the geometry |
FIK | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} \mathbf{K})\) with \(K_{ij} = K(A_i, A_j)\) as in Section 4.2.2 with hexagonal points (Panel 4.5C), \(L_i = 10\), and fixed length-scale \(l\) |
IK | \(\mathbf{u} \sim \mathcal{N}(0, \tau_u^{-1} \mathbf{K})\) with \(K_{ij} = K(A_i, A_j)\) as in Section 4.2.2 with hexagonal points (Panel 4.5C), \(L_i = 10\), and length-scale prior distribution \(l \sim \mathrm{Inv{\text-}Gamma}(a, b)\) with \(a, b\) set based on the geometry |
Eight inferential models were fit to the synthetic data (Table 4.2). Apart from the spatial random effect specification, each inferential model corresponded exactly to the simulation model.
4.3.2.1 Kernels
Gram matrices were computed using the Matérn kernel \(k: \mathcal{S} \times \mathcal{S} \to \mathbb{R}\) (Stein 1999) given by \[\begin{equation} k(s, s') = \frac{1}{2^{\nu - 1}\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\lvert s - s' \rvert}{l}\right)^\nu B_\nu\left(\frac{\sqrt{2\nu}\lvert s - s' \rvert}{l}\right). \tag{4.14} \end{equation}\] In Equation (4.14):
- \(B_\nu\) is the modified Bessel function of the second kind;
- \(|s - s'|\) is the Euclidean distance between the point locations \(s\) and \(s'\);
- \(\nu\) is the smoothness hyperparameter;
- \(l\) is the length-scale hyperparameter on the latitude-longitude scale.
We fixed the smoothness hyperparameter \(\nu\) to \(3/2\) to avoid concerns regarding the joint identifiability of the smoothness and lengthscale hyperparameters. This value matches that used to simulate data, and simplifies Equation (4.14) as follows \[\begin{equation} k(s, s') = \left(1 + \frac{\sqrt{3} \lvert s - s' \rvert}{l} \right) \exp \left(- \frac{\sqrt{3} \lvert s - s' \rvert}{l} \right). \end{equation}\]
The number of points per area \(L_i\) was set to 10 with a hexagonal spacing structure (Panel 4.5C).
The actual values of \(L_i\) sometimes differed from 10 because sf::st_sample
with type = "hexagonal"
does not guarantee exactly the specified number of samples are returned (E. Pebesma 2018).
4.3.2.2 Prior distributions
A weakly informative half-Gaussian prior was placed on the standard deviation such that \(\sigma_u \sim \mathcal{N}_+(0, 2.5^2)\) (Gelman 2006). The value 2.5 avoids placing significant prior density on the region \(\sigma_u > 5\), which after logistic transformation would facilitate undesirable variation on the probability scale very close to zero or one. A weakly informative \(\mathcal{N}(-2, 1)\) prior was placed on \(\beta_0\), setting most of the prior probability density for \(\text{logit}^{-1}(\beta_0)\) within a range typical for a disease prevalence.
In cases where the length-scale \(l\) was fixed, it was set based on the geometry such that points an average distance apart had 1% correlation (N. Best et al. 1999). In cases where a prior distribution was set on the length-scale it was \(l \sim \mathrm{Inv{\text-}Gamma}(a, b)\), with \(a\) and \(b\) chosen for each geometry such that 5% of the prior mass was below the 5% quantile for distance between points and 5% of the prior mass was above the 95% quantile (Betancourt 2017). The sensitivity analysis in Appendix A.2 illustrates the extent to which six possible lengthscale prior distributions (Figure A.9) affect the lengthscale posterior distribution (Figure A.10).
4.3.2.3 Inference
Approximate Bayesian inference was conducted using adaptive Gauss-Hermite quadrature [AGHQ; Stringer, Brown, and Stafford (2022)] with \(k = 3\) quadrature points over a marginal Laplace approximation via the aghq
package (Stringer 2021).
Models were implemented using a Template Model Builder C++ template for the log-posterior via the TMB
package (Kristensen et al. 2016).
Appendix A.1 compares posterior mean and standard deviations from AGHQ to those obtained using the No-U-Turn Sampler (NUTS) Hamiltonian Monte Carlo (HMC) algorithm run using Stan
(Carpenter et al. 2017) via the tmbstan
package (Monnahan and Kristensen 2018).
4.3.3 Model assessment
Let the parameter \(\phi\) have posterior marginal \(f(\phi) = p(\phi \, | \, \mathbf{y})\) with cumulative distribution function \(F\). Let \(\phi_s\) be samples \(s \in [S]\) from \(f\). Here, the number of samples per posterior marginal was \(S = 200\). Let \(\omega\) be the true value of \(\phi\) used in the simulation.
The accuracy of latent field parameter and hyperparameter posterior marginals from each model were assessed using three methods. These were the mean squared error (MSE), the continuous ranked probability score [CRPS; Matheson and Winkler (1976)], and the probability integral transform (PIT; Dawid (1984)) values.
The MSE is a simple and popular measure, calculated using samples as \[\begin{equation} \text{MSE}(f, \omega) \approx \frac{1}{S} \sum_{s = 1}^S (\phi_s - \omega)^2. \end{equation}\] The CRPS is a strictly proper scoring rule (SPSR) which has favourable properties and is often regarded as a default choice (Gneiting and Raftery 2007). Any scoring rule which is not strictly proper rewards a misrepresentation of beliefs. The CRPS is \[\begin{equation} \text{CRPS}(f, \omega) = \int_{-\infty}^{\infty} (F(\phi) - \mathbb{I} \{\phi \geq \omega \} )^2 \text{d}\phi. \tag{4.15} \end{equation}\] The CRPS may be estimated using samples by \[\begin{equation} \text{CRPS}(f, \omega) \approx \frac{1}{S} \sum_{s = 1}^S | \phi_s - \omega | - \frac{1}{2S^2} \sum_{s = 1}^S \sum_{l = 1}^S | \phi_s - \phi_l |. \tag{4.16} \end{equation}\]
A posterior marginal is calibrated if over repeated simulations the quantile of the true value, known as the PIT value, is uniformly distributed such that \[\begin{equation} F(\omega) \approx \frac{1}{S} \sum_{s = 1}^S \mathbb{I} \{\phi_i \geq \omega \} = q \sim \mathcal{U}[0, 1]. \tag{4.17} \end{equation}\] If Equation (4.17) holds then at any given nominal coverage \(1 - \alpha\) the proportion of quantile-based credible intervals containing \(\omega\) is also \(1 - \alpha\). Uniformity was assessed using PIT histograms (Dawid 1984) and empirical cumulative distribution function (ECDF) difference plots (Aldor-Noiman et al. 2013) with simultaneous confidence bands as described in Säilynoja, Bürkner, and Vehtari (2022).
4.3.4 Results
4.3.4.1 Vignette geometries
As each geometry only had three areas, the sample size of 250 synthetic data was insufficient to distinguish between inferential models for the vignette geometries. Figures A.13, A.14, A.15 and A.16 show that almost all 95% credible intervals for the mean CRPS in estimating \(\rho_i\) overlap.
Additionally, for the vignette geometries, both the heuristic method for fixing a lengthscale, and lengthscale prior distribution, were misspecified. Three points was insufficient to learn the lengthscale, and as such misspecification of the prior distribution propagated to the posterior distribution (Figure A.11).
To produce higher resolution and more meaningful results, the simulation study for the vignette geometries should be rerun. Two changes should be made. First, an increase to the sample size. Second, more careful specification of study with regard to the lengthscale.
4.3.4.2 Realistic geometries
The two problems with the vignette geometry study did not apply to the more realistic geometries. Figures 4.7, 4.8 and 4.9 show mean CRPS values in estimating \(\rho_i\) with 95% credible intervals which rarely overlap, and hence provide meaningful findings. Mean MSE and CRPS values are provided in Tables A.2 and A.3. The mean values are an average over both the number of areas in each geometry and the number of simulations run.
The mean CRPS varied substantially between the three models (Table 4.1) used to simulate synthetic data. IID structure is harder to predict than spatial structure, and to a lesser extent, Besag structure is harder to predict than IK. This observation is explained by correlation structure making forecasting easier.
For IID synthetic data, the IID and BYM2 models performed well. The BYM2 model also performed almost as well as the Besag model on the spatially structured synthetic data. Appendix A.3.2 shows that the BYM2 proportion parameter successfully recovers either IID or spatial structure. Meanwhile, the IID model performed poorly on spatially structured synthetic data.
The performance of kernel models on IID and Besag synthetic data diminished with increasingly spatially irregular geometry. For the most part, differences between the centroid and integrated kernel models were small, even for synthetic data generated from the IK model. Only for the IK simulated data there was a significant difference between the kernel models with a fixed lengthscale and prior distribution set on the lengthscale.
Interpretation of CRPS choropleths (Figures 4.7, 4.8 and 4.9) was challenging primarily due to two factors: varying scores by simulation model, and limited sample size at the area-level. It would be relatively simple to remedy these challenges, such that figures of this kind could help to uncover precise findings about spatial random effect models.
For IID synthetic data, spatial models tend to produce “U”-shaped ECDF difference plots (Figures A.28, A.29 and A.30). In other words, the quantile of the true value is too often near zero or one. This pattern corresponds to over-smoothing.
4.4 HIV prevalence study
Simulation studies are a valuable tool for experimenting on models in controlled environments. However, it is difficult to capture the complexity of a realistic applied scenario using simulation. Therefore, it is important to complement simulation studies with studies conducted on real data. To this end, model performance was compared in estimating district-level HIV prevalence \(\rho_i \in [0, 1]\) in adults aged 15-49. Household survey data was used from across four countries in sub-Saharan Africa (Table 4.3, Figure 4.10).
Country | Survey | Number of areas | Analysis level |
---|---|---|---|
Côte d’Ivoire | PHIA 2017 | 33 | Regions |
Malawi | PHIA 2016 | 31 | Health districts and cities, with islands removed |
Tanzania | PHIA 2017 | 26 | Regions, with islands removed |
Zimbabwe | PHIA 2016 | 60 | Districts |
4.4.1 Household survey data
Data from the most recent publicly available Population Health Impact Assessment (PHIA) survey were used in each country. Let \(y_{ij} \in \{0, 1\}\) be the survey response for individual \(j\) in area \(i\). The survey designs used were complex in that each individual had potentially unequal probabilities \(\pi_{ij}\) of being included in the survey. Sampling weights \[\begin{equation} w_{ij} = \frac{1}{\pi_{ij}} \end{equation}\] were used to account for the complex survey design. The survey weighted prevalence in area \(i\) is \[\begin{equation} \rho_i^\star = \frac{\sum_{j} w_{ij} y_{ij}}{\sum_{j} w_{ij}}. \tag{4.18} \end{equation}\] The effective number of cases \(y_i^\star = \rho_i^\star \cdot m_i^\star\) is given by the product of the weighted prevalence, and the Kish effective sample size (Kish 1965) \[\begin{equation} m_i^\star = \frac{(\sum_j w_{ij})^2}{\sum_j w_{ij}^2}, \end{equation}\] and may be intuitively thought of as what would have been observed had the survey been a simple random sample.
4.4.2 Inferential models
The inferential models used correspond to those in Section 4.3 with a small modification. As before, prevalences \(\rho_i\) were modelled via \(\text{logit}(\rho_i) = \beta_0 + u_i\) with spatial random effect specification varied according to Table 4.2. Due to survey weighting, the effective number of cases \(y_i^\star \in \mathbb{R}\) and effective sample size \(m_i^\star \in \mathbb{R}\) may not be integers. Following Chen, Wakefield, and Lumely (2014) a generalised binomial distribution \(y_i^\star \sim \text{xBin}(m_i^\star, \rho_i)\) was used, with working likelihood for \(m^\star_i \geq y^\star_i\) given by \[\begin{equation} p(y_i^\star \, | \, m_i^\star, \rho_i) = \frac{\Gamma(m_i^\star + 1)}{\Gamma(y_i^\star + 1) \Gamma(m_i^\star - y_i^\star + 1)} \rho_i ^{y_i^\star} (1 - \rho_i)^{(m_i^\star - y_i^\star)}. \tag{4.19} \end{equation}\]
4.4.3 Model comparison
Each model was assessed using (Figure 4.11):
- a regular leave-one-out cross-validation (LOO-CV);
- a spatial leave-one-out cross-validation (SLOO-CV).
At each fold the CRPS, MSE and quantile (as in Section 4.3.3) of posterior predictive samples as compared with the observed data were computed. In this section, the number of samples per posterior marginal was \(S = 1000\).
4.4.4 Results
PHIA survey | Continuous ranked probability score (units: 1/1000) | ||||||
---|---|---|---|---|---|---|---|
IID | Besag | BYM2 | FCK | CK | FIK | IK | |
LOO | |||||||
Côte d’Ivoire, 2017 | 6.6 | 6.6 | 6.7 | 6.7 | 6.9 | 6.9 | 6.9 |
Malawi, 2016 | 31.7 | 19.5 | 19.6 | 22.7 | 22.8 | 21.4 | 21.0 |
Tanzania, 2017 | 14.9 | 12.1 | 13.4 | 10.7 | 9.5 | 10.3 | 10.6 |
Zimbabwe, 2016 | 28.9 | 20.8 | 20.9 | 21.7 | 21.6 | 21.4 | 22.0 |
SLOO | |||||||
Côte d’Ivoire, 2017 | 6.5 | 6.6 | 6.6 | 6.4 | 6.9 | 6.4 | 6.8 |
Malawi, 2016 | 31.6 | 19.3 | 19.9 | 26.5 | 29.0 | 25.1 | 28.3 |
Tanzania, 2017 | 14.9 | 12.1 | 18.1 | 16.0 | 17.6 | 15.4 | 16.9 |
Zimbabwe, 2016 | 29.1 | 20.8 | 25.2 | 26.7 | 26.2 | 26.1 | 26.3 |
The results (Figure 4.12, Table 4.4, Table A.4) for each survey were as follows:
For the 2017 PHIA survey in Côte d’Ivoire, all of the models performed similarly, using both LOO- and SLOO-CV (Figure A.37). The pointwise CRPS for all models was high at one outlying district in the survey, Grand-Ponts. It is difficult to see how any spatial random model would perform well in this situation, without additional covariates or using a distribution with heavier tails than the Gaussian.
The CK and IK models had lengthscale posterior distributions largely unchanged from their prior distribution (Figure A.31). This uncertainty in lengthscale resulted in wide prevalence 95% credible intervals for the CK and IK models in Figure A.33. This example shows the importance of being careful using kernel models, and the prior distributions set on their hyperparameters. It is surprisingly that this behaviour appears not to have resulted in poor LOO or SLOO performance.
For this survey the BYM2 proportion posterior distribution was also similar to its prior distribution, in contrast to each of the other surveys which had BYM2 proportion posteriors peaked at one, corresponding to spatially structured noise (Figure A.32).
For the 2017 PHIA survey in Malawi the Besag and BYM2 models performed the best, followed by the kernel models, and then the IID model (Figure A.38). While the LOO and SLOO CRPS values for IID, Besag and BYM2 models were similar, for the kernel models forecasting performance was substantially reduced by leaving out adjacent districts. This finding is surprising, as the kernel models make use of more distant correlations, and it is the adjacency-based models that one would intuitively expect to be hampered more by the SLOO-CV. For the IID model, that LOO and SLOO performance are similar is no surprise as in all cases the IID model should be predicting the mean. Though less data are available in the SLOO case, this should be of little consequence.
For the 2017 PHIA survey in Tanzania (Figure A.39), under LOO-CV the kernel models performed better, but under SLOO-CV there was a significant drop in performance.
Finally, for the 2016 PHIA survey in Zimbabwe, performance for each of the spatially structured models was similar (Figure A.40). Again, under SLOO-CV, performance of the BYM2 and kernel-based models dropped. Differences within the kernel-based models for this survey, and indeed across all four surveys, were limited.
4.5 Discussion
4.5.1 Modelling
Though there are situations where other models perform better, on the whole this study supports the use of adjacency-based spatial random effect models. For the study on HIV survey data, adjacency-based models performed well, if not the best, in all cases. That is not to say that under data truly generated from a kernel model, there isn’t significant benefit to using the corresponding kernel model for inference. However, the transferability of this finding to applied settings is limited by the following factors. First and foremost, it is usually impossible to know that real data was generated from any particular process. Second, the synthetic data study used the same kernel, Matérn with \(\nu = 3/2\) (Equation (4.14)), for both simulation and inference, and as such represents a best-case. Third, specification of the lengthscale prior distribution is challenging, and easy to do badly. Finally, aggregation via the integrated kernel occurred at the level of the latent field, despite the fact that most of the time we expect aggregation to occur at the level of the data. If the link function \(g\) is the identity or linear then the two are equivalent, but non-linear link functions create a discrepancy, which this study did not address.
This chapter did not consider use of the stochastic partial differential equation (SPDE) approximation of Lindgren, Rue, and Lindström (2011) as a potentially more computationally efficient way to implement integrated kernel models (Wilson and Wakefield 2018). Though the underlying models are ultimately similar, that is a continuous Matérn random field over space aggregated at an area-level, the findings from this work are likely to apply to use of the SPDE approximation. Nonetheless, it would be of value to confirm this empirically.
This chapter used area-level models to for data which arises by aggregation of point-level data. However, Konstantinoudis et al. (2020) found that using a point-level LGCP model rather than an area-level BYM model may have significant benefits. The work in this chapter does not address the broader question of under which circumstances use of an area or point-level model is sensible.
The adjacency-based models considered in this study were limited to the Besag and BYM2 model. Although these are perhaps the most widely used adjacency-based models, others could have been considered. Examples include the more general weighted ICAR model discussed in Section 4.1.4. Additionally, it would be of interest to implement the integrated kernel model with population-based weighting (Section 4.2.2.1).
The models used for spatial structure in this chapter were all stationary. Although stationarity assumptions may be violated by HIV survey data, it remains challenging to estimate non-stationary spatial structure (Christopher J. Paciorek and Schervish 2006).
4.5.2 Model comparison
Previous spatial random effect comparison studies (Nicky Best, Richardson, and Thomson 2005; Lee 2011) were limited to the DIC measure of model performance. Use of the DIC is strongly discouraged by Vehtari, Gelman, and Gabry (2017). This study used less flawed measures of model performance, such as the cross-validated CRPS. It would be beneficial to compute the DIC and WAIC in Section 4.4 as a comparison. Additionally, the measures used in this study were computed and presented by individual area. With refinements to the sample sizes used, these area-specific measures of performance could enable more nuanced conclusions about the use of spatial random effect models.
Cross-validation was performed using \(\rho\) as the forecasting target, rather than \(y\) as is typical. This decision was made because applied interest is in forecasting HIV prevalence at a district level, not forecasting the outcome of a household survey. It could be argued that a district does not become more important to forecast well by virtue of surveying a larger sample size in that district. That said, an alternative viewpoint is that forecast accuracy should be incentivised in proportion to district population size, such that PLHIV is accurately estimated. If sample size is proportional to population size, then forecasting \(y\) could be a useful proxy. Choice of the particular parameter, or transformation of that parameter (Nikos I. Bosse et al. 2023), to score is an ongoing topic of research.
The CRPS was used in preference to the log-score. Whereas the log-score requires a kernel density estimate of the posterior distribution, and is therefore sensitive to tuning parameters, the CRPS can be estimated from samples alone. A downside of use of the CRPS and MSE is their relative lack of interpretability. For example, it is difficult to determine whether a forecast is good, or suitable for practical use, on the basis of its CRPS or MSE. Measures such as the skill score have been used to contrast forecast performance with some baseline. A constant model, with no random effects, could be used as such a baseline.
4.5.3 Inference
A strength of this work is that all of the inferential models (Table 4.2) in this chapter were implemented in TMB
.
Inference was then conducted using AGHQ over the marginal Laplace approximation using the aghq
package.
The accuracy of inferences was compared to gold-standard results from NUTS obtained using the tmbstan
package.
An earlier version of this study used R-INLA
.
Not all of the inferential models were compatible with R-INLA
, so rstan
was used in some cases.
However, due to the difference in inference algorithm, this study design conflated statistical models with inference algorithms.
Consistent use of TMB
, a fast and flexible tool for spatial modelling (Osgood-Zimmerman and Wakefield 2023), overcame this limitation.
Chapter 6 extends TMB
to implement the INLA algorithm of R-INLA
.