knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  autodep = TRUE,
  cache.comments = FALSE
)

library(tidyverse)
library(TMB)
library(INLA)
library(rmarkdown)

midblue <- "#3D9BD0"
midgreen <- "#00855A"
midpink <- "#B3608E"

Background

Consider a three-stage model \[\begin{alignat}{2} &\text{(Observations)} & y_i &\sim p(y_i \, | \, x_i, \theta), \quad i \in \mathcal{I}, \\ &\text{(Latent field)} & x &\sim \mathcal{N}(x \, | \, 0, Q(\theta)^{-1}), \\ &\text{(Hyperparameters)} & \qquad \theta &\sim p(\theta), \label{eq:hyper} \end{alignat}\] with posterior distribution \[\begin{equation} p(x, \theta \, | \, y) \propto p(y, x, \theta) = p(y \, | \, x, \theta) p(x \, | \, \theta) p(\theta). \end{equation}\]

After writing the negative log-target in a C++ file model.cpp it can be compiled and loaded into R:

compile("model.cpp")
dyn.load(dynlib("model"))

Let \(f(x, \theta) \equiv - \log p(y, x, \theta)\) where \(y\) is assumed to be fixed. This objective (f$fn) and its derivative (f$gn) can be created (sketch) in TMB using:

f <- MakeADFun(data = dat, parameters = param, DLL = "model")

There are four stages to replicating the INLA method.

Stage 1: get \(\tilde p(\theta \, | \, y)\)

Approximate the posterior marginal \(p(\theta \, | \, y) \propto L(\theta)\) using the Laplace approximation \(L(\theta) \approx L^\star(\theta)\): \[\begin{equation} \tilde p(\theta \, | \, y) \propto \frac{p(y \, | \, x, \theta) p(x \, | \, \theta) p(\theta)}{\tilde p_G(x \, | \, \theta, y)} \Big\rvert_{x = \mu(\theta)} \equiv L^\star(\theta) \end{equation}\]

Using TMB if we integrate out a Gaussian approximation to the latent field \(x\) using the argument random = "x" in MakeADFun then the result is the negative log of the Laplace approximation to the marginal likelihood \(h(\theta) \equiv - \log L^\star(\theta)\):

h <- MakeADFun(data = dat, parameters = param, DLL = "model", random = "x")

Therefore, exp(-h$fn(theta)) is proportional to \(\tilde p(\theta \, | \, y)\):

Stage 2: choose \(\{\theta_k, \Delta_k\}_{k = 1}^K\)

Integration points \(\{\theta_k\}\) with weights \(\sum_{k = 1}^K \Delta_k = 1\).

Empirical Bayes

\(K = 1\) with integration point \(\hat \theta = \text{argmin}_{\theta} (-\log L^\star_f(\theta))\). This can be found by minimising h using optimisers like nlminb or optim. Sometimes methods like this are referred to as “Empirical Bayes” where the hyperparameters are estimated with a fixed point estimate.

R-INLA grid method

Instead of just taking a single point, we might want to explore the posterior approximation to \(p(\theta \, | \, y)\) more fully. In R-INLA a few choices are available as to how to do this, including the grid integration strategy as follows:

  1. Locate the mode of \(\log p(\theta \, | \, y)\) which is the same as \(\text{argmin}_{\theta} (-\log L^\star_f(\theta))\) and can be calculated as above.
  2. Calculate negative Hessian at the mode (this could be done in TMB using h\(env\)spHess`)
  3. Reparameterise in terms of \(z\) coordinates along the eigenvectors of the Hessian starting at the mode
  4. Test points at a certain step-size (\(\delta_z\)) against a condition whereby they must have sufficient posterior density (using a parameter \(\delta_{\pi}\))

Under this approach, all weights would be equal \(\Delta_k = 1 / K\). For a more detailed run through of the grid integration method, see the INLA two dimensional grid replication notebook.

Adaptive Gauss-Hermite quadrature

We could also use an adaptive Gauss-Hermite quadrature (AGHQ) rule. See the Implementing Approximate Bayesian Inference using Adaptive Quadrature: the aghq Package paper. AGHQ is attractive because it is automatic, works well for integrating densities close to being Gaussian.

Stage 3: get \(\tilde p(x_i \, | \, \theta, y)\)

Gaussian

The simplest approach would be to directly obtain the marginals of \(\tilde p_{\text{G}}(x \, | \, \theta, y)\) from above.

Laplace

Compute the Laplace approximation \(\tilde p_{\text{LA}}(x_i \, | \, \theta, y)\).

Simplified Laplace

This is the key innovation in the 2009 INLA paper.

Stage 4: apply quadrature

Combine the Laplace approximation to the hyperparameter posterior from Stage 1, the quadrature nodes and weights from Stage 2, and the chosen posterior latent field marginal approximation from Stage 3 together in one equation: \[\begin{equation} \tilde p(x_i \, | \, y) = \sum_{k = 1}^K \tilde p(x_i \, | \, \theta^{(k)}, y) \times \tilde p(\theta^{(k)} \, | \, y) \times \Delta^{(k)} \end{equation}\]

Example 1: Blangiardo

This is the example from the book Spatial and Spatio-temporal Bayesian Models with R-INLA. Consider the following simple model for i.i.d. observations \(y = (y_1, \ldots, y_n)\) \[\begin{align} y_i \, | \, x, \theta &\sim \mathcal{N}(x, 1/\theta), \\ x &\sim \mathcal{N}(x_0 = -3, 1/\tau_0 = 4), \\ \theta &\sim \text{Gamma}(a = 1.6, b = 0.4). \end{align}\]

We start by loading the data manually:

y <- c(
  1.2697, 7.7637, 2.2532, 3.4557, 4.1776, 6.4320, -3.6623, 7.7567, 5.9032, 7.2671,
  -2.3447, 8.0160, 3.5013, 2.8495, 0.6467, 3.2371, 5.8573, -3.3749, 4.1507, 4.3092,
  11.7327, 2.6174, 9.4942, -2.7639, -1.5859, 3.6986, 2.4544, -0.3294, 0.2329, 5.2846
)

n <- length(y)
y_bar <- mean(y)

The observations (ordered by index, though this has no interpretation) and prior (mean shown as dashed line, which is significantly different to the data):

data.frame(index = 1:30, y = y) %>%
  ggplot(aes(x = index, y = y)) +
  geom_point() +
  geom_hline(yintercept = -3, col = "#666666", lty = "dashed") +
  theme_minimal()

We use the following parameters of the prior distribution:

x_0 <- -3
tau_0 <- 1/4
a <- 1.6
b <- 0.4

Implementation using R-INLA

This model is simple to fit using the R-INlA software directly:

formula <- y ~ 1
dat <- list(y = y)

theta_prior <- list(prec = list(prior = "loggamma", param = c(a, b)))

fit <- inla(
  formula,
  data = dat,
  control.family = list(hyper = theta_prior),
  control.fixed = list(mean.intercept = x_0, prec.intercept = tau_0)
)

Implementation using R

By construction, the distribution \(p(x \, | \, \theta, y)\) is exact:

# Exact distribution of x given theta
inner_loop <- function(theta) {
  tau_n <- n * theta + tau_0
  x_n <- (theta * n * y_bar + tau_0 * x_0) / tau_n
  return(list(x_n = x_n, tau_n = tau_n))
}

Functions with \(\theta > 0\) constraint and on the \(\log(\theta)\) scale to avoid this constraint:

nl_full_conditional_x <- function(x, theta, log_input = FALSE) {
  if(log_input == TRUE) theta <- exp(theta) # Convert from log(theta) to theta
  par <- inner_loop(theta)
  return(-dnorm(x, par$x_n, sqrt(1 / par$tau_n), log = TRUE))
}

nl_post_marginal_theta <- function(theta, log_input = FALSE) {
  target <- 0
  if(log_input == TRUE) {
    target <- target + theta # Increment by Jacobian correction (theta is l_theta here)
    theta <- exp(theta) # Convert from log(theta) to theta
  }
  par <- inner_loop(theta)

  # (^)
  target <- target + -0.5 * log(par$tau_n) +
    dgamma(theta, shape = a, rate = b, log = TRUE) +  # theta prior
    dnorm(par$x_n, x_0, sqrt(1 / tau_0), log = TRUE) +  # x prior
    sum(dnorm(y, par$x_n, sqrt(1 / theta), log = TRUE)) # y likelihood

  return(-target)
}

Note that the code above in the section marked (^) is very similar to that in blangiardo.cpp. Recall that the Laplace approximation \(\tilde p(\theta \, | \, y)\), which in this instance coincides with the exact posterior, is given by \[\begin{equation} \tilde p(\theta \, | \, y) \propto \frac{p(y, \mu(\theta), \theta)}{\det(Q(\theta))^{1/2}}. \end{equation}\] This form can be seen in (^):

  • \(p(y, \mu(\theta), \theta)\) corresponds to the calls to dgamma, dnorm and sum of dnorm evaluated at \(\mu(\theta)\) which here is simply par$x_n (for the Normal distribution the mean of the distribution is the same as the mode).
  • The logarithm of \(1 / \det(Q(\theta))^{1/2}\) when \(\theta\) is a scalar is given by -0.5 * log(par$tau_n).

It should be the case that inner_loop produces parameter values for the latent field which are those found by TMB using the optimisation inner loop (hence the name).

de_nl <- function(f, ...) exp(-f(...)) # Versions which are not negative logarithms

full_conditional_x <- function(x, theta, log_input = FALSE) {
  de_nl(nl_full_conditional_x, x, theta, log_input = FALSE)
}

post_marginal_theta <- function(theta, log_input = FALSE) {
  de_nl(nl_post_marginal_theta, theta, log_input = FALSE)
}

Simple grids (one dimensional):

trapezoid_rule <- function(x, spacing) {
  0.5 * spacing * (2 * sum(x) - x[1] - x[2])
}

eval_grid <- function(grid = NULL, spacing = NULL, uniform = FALSE, K = NULL, min = NULL, max = NULL, f) {
  if(uniform) {
    grid <- seq(min, max, length.out = K)
    spacing <- (max - min) / K
  }
  df <- data.frame(input = grid, output = sapply(grid, f))
  df <- mutate(df, norm_output = output / trapezoid_rule(output, spacing))
  return(df)
}

blangiardo_theta <- eval_grid(
  uniform = TRUE,
  K = 25, min = 0.001, max = 0.3,
  f = post_marginal_theta
)

dense_theta <- eval_grid(
  uniform = TRUE,
  K = 500, min = 0.001, max = 0.3,
  f = post_marginal_theta
)
ggplot(dense_theta) +
  geom_line(aes(x = input, y = output), col = midblue) +
  geom_point(data = blangiardo_theta, aes(x = input, y = output), shape = 4) +
  theme(axis.text.y=element_blank()) +
  labs(title = "Blangiardo integration points for the hyperparameters") +
  theme_minimal()
True posterior marginal of theta (blue line) overlaid with the choice of integration points. This naive grid places too many points in regions of the parameter space without much posterior density.

True posterior marginal of theta (blue line) overlaid with the choice of integration points. This naive grid places too many points in regions of the parameter space without much posterior density.

Optimisation on the log scale (to avoid constrained optimisation):

its <- 1000
r_nlminb <- nlminb(start = 0,
                   objective = nl_post_marginal_theta,
                   log_input = TRUE,
                   control = list(iter.max = its, trace = 0))

r_optim <- optim(par = 0,
                 fn = nl_post_marginal_theta,
                 log_input = TRUE,
                 method = "Brent",
                 lower = -100, # Have to specify upper and lower bounds
                 upper = 100, # when using optimize (1D function)
                 control = list(maxit = its, trace = 0))

c(r_nlminb$par, r_optim$par) # The same
## [1] -2.640129 -2.640129
r_opt <- r_nlminb

For INLA’s grid strategy, starting from the mode, we take steps of size \(\delta_z\) checking that each point meets the criteria \[\begin{equation} \log \tilde p(\theta(0) \, | \, y) - \log \tilde p(\theta(z) \, | \, y) < \delta_\pi. \label{eq:criteria} \end{equation}\] In the one dimensional case there is no need to do the \(z\)-parametrisation. Choosing \(\delta_z\) and \(\delta_\pi\) based upon manual tuning1:

delta_z <- 0.05
delta_pi <- 3

The following is very inefficient R programming but just an idea as to how it could be done:

# Increasing
points <- c(r_opt$par) # On the log theta scale

i <- 0
condition <- TRUE
while(condition) {
  i <- i + 1
  proposal <- r_opt$par + i * delta_z
  statistic <- nl_post_marginal_theta(theta = proposal, log_input = TRUE) -
               nl_post_marginal_theta(theta = r_opt$par, log_input = TRUE)
  condition <- (statistic < delta_pi)
  if(condition){
    points <- c(points, proposal)
  }
}

# Decreasing
i <- 0
condition <- TRUE
while(condition) {
  i <- i + 1
  proposal <- r_opt$par - i * delta_z
  statistic <- nl_post_marginal_theta(theta = proposal, log_input = TRUE) -
               nl_post_marginal_theta(theta = r_opt$par, log_input = TRUE)
  condition <- (statistic < delta_pi)
  if(condition){
    points <- c(points, proposal)
  }
}

inla_theta <- eval_grid(exp(points), f = post_marginal_theta, spacing = delta_z)

ggplot(dense_theta) +
  geom_line(aes(x = input, y = output), col = midblue) +
  geom_point(data = inla_theta, aes(x = input, y = output), shape = 4) +
  theme(axis.text.y=element_blank()) +
  labs(title = "INLA integration points for the hyperparameters") +
  annotate(
    "text", x = 0.175, y = 5e-41,
    label = "Now the points are concentrated \n where there is higher density"
  ) +
  theme_minimal()
The INLA method grid points. This looks better but it still looks a little strange probably because of the reparametrisation. Then there is also the task of choosing the values of $\delta_z$ and $\delta_\pi$.

The INLA method grid points. This looks better but it still looks a little strange probably because of the reparametrisation. Then there is also the task of choosing the values of \(\delta_z\) and \(\delta_\pi\).

One of the INLA vignettes discusses how the user can set their own integration points – see browseVignettes(package = "INLA"). The integration points that R-INLA uses can be found using fit$joint.hyper:

internals_inla_theta <- data.frame(
  input = exp(fit$joint.hyper$`Log precision for the Gaussian observations`),
  output = exp(fit$joint.hyper$`Log posterior density`)
)

internals_inla_theta$input
##  [1] 0.12662611 0.07130574 0.04015372 0.05888323 0.03315835 0.08634902 0.02738167 0.10456596 0.15334025 0.04862490

We take the set of points defined by inla_theta to be our \(\{\theta^{(k)}\}\). \(K\) is given by length(points) which equals 26.

Define a function nl_joint_post by taking the negative log of the joint posterior \(p(x, \theta \, | \, y) = p(x \, | \, \theta, y) p(\theta \, | \, y)\) to give \[\begin{equation} - \log p(x, \theta \, | \, y) = - \log p(x \, | \, \theta, y) - \log p(\theta \, | \, y). \end{equation}\]

nl_joint_post <- function(x, theta, log_input = FALSE) {
  nl_full_conditional_x(x, theta, log_input) +
  nl_post_marginal_theta(theta, log_input)
}

For any given input \(x\) you can do quadrature according to \[\begin{equation} \tilde p(x \, | \, y) = \sum_{k = 1}^K p(x \, | \, \theta^{(k)}, y) \times p(\theta^{(k)} \, | \, y) \times \Delta^{(k)}, \end{equation}\] but how should the range of \(x\) be chosen? Blangiardo use a naive grid for demonstration:

x_start <- -8
x_end <- 5
x_out <- 50
blangiardo_x <- seq(x_start, x_end, length.out = x_out)

Modify nl_joint_post to accept and return vectors using the base R function Vectorize, then apply the outer product:

v_nl_joint_post <- Vectorize(nl_joint_post, vectorize.args = c("x", "theta"))
nl_joint_post <- outer(blangiardo_x, points, v_nl_joint_post, log_input = TRUE)

50 rows (the number of \(x\) integration points) and 26 columns (the number of \(\theta\) integration points):

dim(nl_joint_post)
## [1] 50 26

The \(x\) and \(y\)-axis do not correspond to values here, just indices:

reshape2::melt(nl_joint_post) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
    geom_tile() +
    theme_minimal() +
    labs(fill = "", x = "x", y = "theta") +
    theme(legend.position = "none")

Calculating the log posterior by integrating over the hyperparameters for each value of \(x\):

K <- length(points)
nl_post_marginal_x <- rowSums(nl_joint_post) / K

df <- data.frame(
  x = blangiardo_x,
  log_post = -1 * nl_post_marginal_x
)

ggplot(df, aes(x = x, y = log_post)) +
  geom_line() +
  theme_minimal() +
  labs(x = "x", y = "log(Posterior)")

Try normalising:

x_spacing <- (x_end - x_start) / x_out
safe_exp <- function(x) exp(x - max(x))
df$post <- safe_exp(df$log_post)
norm <- trapezoid_rule(df$post, x_spacing)
df$norm_post <- df$post / norm

ggplot(df, aes(x = x, y = norm_post)) +
  geom_line() +
  theme_minimal() +
  labs(x = "x", y = "Posterior")

Implementation in TMB

C++ for the negative log joint posterior is given by:

// blangiardo.cpp

#include <TMB.hpp>

template <class Type>
Type objective_function<Type>::operator()()
{
  // Data block
  DATA_VECTOR(y);

  // Parameter block
  PARAMETER(x);
  PARAMETER(l_theta);

  // Transformed parameters block
  Type sigma = sqrt(1 / exp(l_theta));

  Type nll;
  nll = Type(0.0);

  // Priors
  nll -= dlgamma(l_theta, Type(1.6), Type(1 / 0.4), true);
  nll -= dnorm(x, Type(-3.0), Type(2.0), true);

  // Likelihood
  nll -= dnorm(y, x, sigma, true).sum();

  // ADREPORT
  ADREPORT(x)
  ADREPORT(sigma);

  return(nll);
}
TMB::compile("blangiardo.cpp")
dyn.load(dynlib("blangiardo"))

param <- list(x = 0, l_theta = 0)

h with Laplace approximation is a function of \(\dim(\theta)\) inputs, integrating out \(x\) by specifying random = c("x"):

h <- MakeADFun(
  data = dat,
  parameters = param,
  random = c("x"),
  DLL = "blangiardo",
  hessian = TRUE,
  silent = TRUE
)

f without Laplace approximation is a function of \(\dim(\theta) + \dim(x)\) inputs:

f <- MakeADFun(
  data = dat,
  parameters = param,
  DLL = "blangiardo",
  hessian = TRUE,
  silent = TRUE
)

Optimisation using nlminb and optim gets the same results:

its <- 1000

tmb_nlminb <- nlminb(
  start = h$par,
  objective = h$fn,
  gradient = h$gr,
  control = list(iter.max = its, trace = 0)
)

tmb_optim <- optim(
  par = h$par,
  fn = h$fn,
  gr = h$gn,
  method = "Brent",
  lower = -100, # Have to specify upper and lower bounds
  upper = 100, # when using optimize (1D function)
  control = list(maxit = its, trace = 0)
)

c(tmb_optim$par, tmb_nlminb$par) # The same
##             l_theta 
## -2.640129 -2.640129
tmb_opt <- tmb_nlminb

sd_out <- sdreport(h, par.fixed = tmb_opt$par, getJointPrecision = TRUE)

Comparison of the posterior mode and standard deviation of \(\log(\theta)\):

#' For some reason the mode in the INLA result below is NA, how to fix this?
#' According to Havard, this is a feature not a bug, and he hasn't found a good way to compute the mode yet:
#' https://groups.google.com/g/r-inla-discussion-group/c/mbSsNO_bydo

kable_data <- data.frame(
  "Mode" = c(NA, r_opt$par, tmb_opt$par),
  "Variance" = c(fit$internal.summary.hyperpar$sd^2, sd_out$cov.fixed, NA)
)

rownames(kable_data) <- c("$\\texttt{R-INLA}$", "R", "$\\texttt{TMB}$")

kableExtra::kable(kable_data, booktabs = TRUE, escape = FALSE, align = "c")
Mode Variance
\(\texttt{R-INLA}\) NA 0.0665250
R -2.640129 0.0651227
\(\texttt{TMB}\) -2.640129 NA

What about for \(x\):

sd_out$par.random # TMB
##        x 
## 2.679973
fit$summary.fixed # INLA
plot(fit$marginals.fixed$`(Intercept)`, type = "l")
abline(v = sd_out$par.random)

Could make a version of h$fn on the same scale as post_marginal_theta for a plot:

tmb_post_marginal_theta <- function(theta) {
  de_nl(h$fn, x = log(theta))
} # (Have tested that this function does as it should)

tmb_dense_theta <- eval_grid(
  uniform = TRUE,
  K = 500, min = 0.001, max = 0.3, 
  f = tmb_post_marginal_theta
)
# Normalised dense_theta and tmb_dense_theta since they include different constants
ggplot(dense_theta) +
  geom_line(aes(x = input, y = norm_output), col = midblue) +
  geom_line(data = tmb_dense_theta, aes(x = input, y = norm_output), col = midpink) +
  theme(axis.text.y = element_blank()) +
  annotate("text", x = 0.1, y = 0.01, label = "TMB", col = midpink) +
  annotate("text", x = 0.045, y = 0.012, label = "R", col = midblue) +
  theme_minimal()
At least this verifies that the optimisation is sensible.

At least this verifies that the optimisation is sensible.

Unit tests

That h$fn the same as in R

Testing to see that h$fn is the same as nl_post_marginal_theta with log_input = TRUE up to a constant:

vals <- seq(-5, 0, by = 1)

tmb_vals <- sapply(vals, h$fn) # Replications show it not to be stochastic
r_vals <- sapply(vals, nl_post_marginal_theta, log_input = TRUE)
rbind(tmb_vals, r_vals, tmb_vals - r_vals)
##                 [,1]        [,2]       [,3]       [,4]        [,5]        [,6]
## tmb_vals 115.9473778 103.3850776 94.9432186 98.0263651 133.1950528 256.0811622
## r_vals   116.8663164 104.3040161 95.8621572 98.9453036 134.1139914 257.0001008
##           -0.9189385  -0.9189385 -0.9189385 -0.9189385  -0.9189385  -0.9189385

That the inner optimisation is the same

get_tmb_inner_optimised_value <- function(l_theta) {
  sd_out <- sdreport(h, par.fixed = l_theta)
  reml <- sd_out$par.random # The REML estimator for x from TMB
  return(reml)
}

tmb_inner <- sapply(vals, get_tmb_inner_optimised_value)
r_inner <- sapply(exp(vals), function(theta) inner_loop(theta)$x_n)
rbind(tmb_inner, r_inner)
##                    x        x        x        x        x        x
## tmb_inner -0.1640675 1.359735 2.433834 2.975408 3.202838 3.290922
## r_inner   -0.1640675 1.359735 2.433834 2.975408 3.202838 3.290922

Looks like this part is correct.

Example 2: Unknown


  1. It would be good to know how INLA selects these numbers in general↩︎