Abstract
Background Detecting exponential growth is a central goal of the Nucleic Acid Observatory (NAO). Regression using a Poisson generalised linear model (GLM) is possibly the simplest statistical approach to detecting exponential growth.
Task In this notebook, originally by Mike and extended by Adam, we use simple simulations and mathematical analysis to further our understanding of exponential growth detection.
Findings
Next steps
knitr::opts_chunk$set(
echo = TRUE,
cache = TRUE,
autodep = TRUE,
cache.comments = FALSE
)
options(scipen = 999)
cbpalette <- multi.utils::cbpalette()
library(tidyverse)
# ggplot helpers
# library(ggbeeswarm)
library(cowplot)
library(patchwork)
theme_set(theme_minimal())
# stats helpers
# library(broom)
Suppose that there is a local epidemic that is growing exponentially at some rate \(r\), so that the total number of cases at time \(t = 1, \ldots, T\) is \[ n_t = n_0 e^{rt}. \]
Meanwhile, cases are detected sporadically, in a manner that we can approximate by a Poisson process with intensity \[ \lambda_t = c_t n_t, \] where \(c_t\) is the detection rate at time \(t\). For now we will assume that \(c_t = c\) is a constant. Another way that \(c\) could be interpreted is as the rate of travel through an international airport that is being monitored via a sentinel site, where we assume that all infections are detected.
Let \(x_t\) be the number of detections at time \(t\) and \(X_t\) be the cumulative number of detections. The number of detections will most clearly appear to be growing exponentially once \(\lambda_t \gg 1\). However, perhaps we can infer positive exponential growth long before this point using an appropriate statistical analysis that accounts for the sporadic (Poisson) nature of detections.
r <- 0.1
c <- 1e-4
T_days <- 125
set.seed(42)
sim <- tibble(t = seq(1, T_days)) %>%
mutate(
n = exp(r * t),
lambda = c * n,
x = rpois(n(), lambda),
X = cumsum(x)
)
We start by simulating exponentially increasing data with \(r\) set to 0.1 and \(c\) set to 0.0001. Note: the value of \(c\) doesn’t change the dynamics other than to scale down \(\lambda_t\), so we don’t need to test different values of \(c\); similarly for \(n_0\). The number of daily detections looks as follows:
p1 <- sim %>%
ggplot(aes(t, x)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = lambda), color = cbpalette[1]) +
labs(x = "Day", y = "Number of detected cases")
p1
Note that the period of time where \(\lambda \approx 1\) is very short. Though in some sense this is the nature of exponential data, if we instead had a much smaller \(r\), this period would be longer. It might be helpful to look on the log scale:
p2 <- sim %>%
ggplot(aes(t, log(x))) +
geom_point(alpha = 0.5) +
geom_line(aes(y = log(lambda)), color = cbpalette[1]) +
labs(x = "Day", y = "log(Number of detected cases)")
p2
Taking this view allows us to more clearly see the challenge presented by Poisson observations. The logarithm of the latent Poisson intensity is increasing linearly, yet the logarithm of the number of cases only starts to look linear at around the hundredth day.
Let’s fit the data using a Poisson regression model of the form \[
x_t \sim \text{Poisson}(\lambda_t), \\
\log(\lambda_t) = \beta_0 + \beta t,
\] where \(\beta_0\) is the
intercept and \(\beta\) is the slope of
the linear predictor. This model is specified in glm
using
the formula x ~ 1 + t
as follows:
fit <- glm(x ~ 1 + t, family = poisson, data = sim)
coef(fit)
## (Intercept) t
## -10.0816799 0.1080257
It isn’t difficult to recover the parameters that we generated the data with. The estimated intercept -10.0816799 nearly matches \(\log(\lambda(0)) = \log(c n(0))\) which is -9.1103404. The estimated slope 0.1080257 very closely matches the true exponential growth rate 0.1.
What about if it were day \(s\) and
we wanted to know if cases were increasing exponentially? In that case,
we would only have some subset of the data, \(x_1, \ldots, x_s\). We will try fitting the
same GLM model as before, but just to this subset. Note, it isn’t
possible to use glm()
before there are any cases.
first_detection <- sim %>% filter(x > 0) %>% slice(1) %>% pull(t)
As such, the first day we will consider is when the first detection happens, which is day 62.
set.seed(42)
fits <- tibble(today = (first_detection + 1):T_days) %>%
mutate(
fit = map(today, ~glm(x ~ t, family = poisson, data = sim[seq(.x),])),
fit = map(fit, broom::tidy, conf.int = TRUE)
) %>%
unnest(fit)
What does our inference for \(\beta_0\) and \(\beta\) look like? Let’s add an indicator for whether the confidence interval is above zero:
fits <- fits %>%
mutate(above_zero = as.logical((conf.low > 0) & (conf.high > 0)))
p3 <- fits %>%
filter(term == 't', today >= first_detection) %>%
ggplot(aes(today, y = estimate, ymax = conf.high, ymin = conf.low, col = above_zero)) +
geom_hline(yintercept = c(0, r), color = "black", linetype = "dashed") +
geom_pointrange(alpha = 0.5) +
scale_colour_manual(values = cbpalette[c(3, 2)]) +
labs(x = "Day", y = "Estimated slope", col = "CI above zero?")
p3
Interestingly, in this case, upon the first detection the 95% CI for \(\beta\) is already above zero! After this date, it then quickly reverts back to including zero until a few weeks later.
fits_long <- fits %>%
select(today, term, estimate) %>%
pivot_wider(
names_from = "term",
values_from = "estimate"
) %>%
rename(
"intercept" = "(Intercept)"
)
sim %>%
ggplot(aes(t, log(x))) +
geom_point(alpha = 0.5) +
geom_line(aes(y = log(lambda)), color = cbpalette[1]) +
geom_abline(data = fits_long, aes(intercept = intercept, slope = t, col = today), alpha = 0.2) +
scale_colour_viridis_c(option = "C") +
labs(x = "Day", y = "log(Number of detected cases)", col = "Model fit to data from\nthis day or before")
On the day that the first case is detected, the linear model (on the log scale) is able to (essentially) perfectly fit that single data point by putting the intercept as being very negative. In particular, the fitted regression model on day 63 is
fits %>%
filter(today == first_detection + 1)
This issue could be fixed by a Bayesian model which doesn’t perfectly interpolate the data (it would be regularlised using priors).
r <- 1e-2
c <- 1e-4
Let’s consider a slower growing pandemic with has a smaller growth rate \(r\) of 0.01, and a low probability of detection \(c\) of 0.0001.
set.seed(42)
sim <- tibble(t = seq(1, 2000)) %>%
mutate(
n = exp(r * t),
lambda = c * n,
x = rpois(n(), lambda),
X = cumsum(x)
)
first_detection <- sim %>% filter(x > 0) %>% slice(1) %>% pull(t) %>% print
## [1] 573
Under these assumptions, the time the first detection occurs is on day 573. What about more generally: what is the distribution over time to first detection \(\psi\)? We are interested in the probability at time \(t\) at least one case has been observed \(F_\psi(t) = \mathbb{P}(\psi \leq t) = \mathbb{P}(x_s > 0 \text{ for some } s \leq t)\). Recalling that \(x_t \sim \text{Poisson}(\lambda_t)\), this can be calculated via \[ \mathbb{P}(x_s > 0 \text{ for some } s \leq t) = 1 - \mathbb{P}(x_s = 0 \text{ for all } s \leq t) \\ = 1 - \prod_{s \leq t} \mathbb{P}(x_s = 0) \\ = 1 - \prod_{s \leq t} \frac{{\lambda_t}^0}{0!} e^{-\lambda_s} \\ = 1 - \prod_{s \leq t} e^{-\lambda_s}, \\ = 1 - \exp\left( {- \sum_{s \leq t} \lambda_s} \right) \]
compute_psi <- function(t, t0, c, n0, r) {
day <- t0:t
lambda_t <- c * n0 * exp(r * day)
1 - exp(-sum(lambda_t))
}
compute_psi(1, 1, c = c, n0 = 1, r = r)
## [1] 0.0001009999
So, for example, the probability that a case is observed on day one, using this formula, is 0.000101. We can plot the probability that at least one case is observed by day as follows
psi <- tibble(t = seq(1, 2000)) %>%
mutate(
cdf = map_dbl(t, ~ compute_psi(.x, 1, c = c, n0 = 1, r = r)),
pdf = c(0, diff(cdf))
)
psi_long <- psi %>%
pivot_longer(
cols = c("pdf", "cdf"),
names_to = "type",
values_to = "value"
)
psi_long %>%
mutate(type = toupper(type)) %>%
ggplot(aes(x = t, y = value)) +
geom_line(col = cbpalette[6]) +
facet_wrap(~type, scales = "free") +
labs(x = "Day", y = "P(At least one case detected)") +
theme_minimal()
So, by the looks of things, 573 is a quite typical draw from the distribution of time to first detection.
Using the fact that \(\mathbb{E}(\psi) = \int_{-\infty}^0 F_\psi(t) \text{d} t + \int_0^\infty (1 - F_\psi(t)) \text{d} t\), where \(F_\psi(\cdot)\) is the cumulative distribution function of \(\psi\), we can compute the expected time to first detection as \[ \mathbb{E}(\psi) = \int_0^\infty (1 - F_\psi(t)) \text{d} t \\ = \sum_{t = 0}^\infty (1 - F_\psi(t)) \\ = \sum_{t = 0}^\infty 1 - \left( 1 - \exp\left( {- \sum_{s \leq t} \lambda_s} \right) \right) \\ = \sum_{t = 0}^\infty \exp\left( {- \sum_{s \leq t} \lambda_s} \right) \]
Perhaps there is a smart way to compute this exactly, but it can be approximated by truncating to a finite sum of the first \(N\) terms \[ \sum_{t = 0}^{N} \exp\left( {- \sum_{s \leq t} \lambda_s} \right) \]
Given that we know the PDF of \(\psi\), it should be quite easy to choose an \(N\) such that the approximation is accurate. Coding this up, we have
#' Implicity N here is length(sim$lambda)
#' Looking at the CDF or PDF of psi, this should be sufficient
#' Given that the intensity is increasing exponentially, and the contribution to the expectation should only be linear
#' there is a good argument that including higher order terms would not change the computation very much
length(sim$lambda)
## [1] 2000
lambda_cum <- cumsum(sim$lambda)
sum(exp(-lambda_cum))
## [1] 406.872
How do the factors \(c\) and \(r\) change the expected time to detection? Rewriting the expression for \(\mathbb{E}(\psi)\) more explicitly \[ \mathbb{E}(\psi) = \sum_{t = 0}^\infty \exp\left( {- \sum_{s \leq t} c \exp \left( -rs \right) } \right) \\ = \sum_{t = 0}^\infty \prod_{s \leq t} \exp\left({- c \exp \left( -rs \right) } \right) \\ \] (I’ve not said anything useful about \(c\) or \(r\) here yet! Perhaps you, reader, can find something interesting to analyze about this equation.)
Rather than analyzing the above equation, let’s just plot the different CDF and PDF curves, varying \(r\) and \(c\):
r_seq <- 10^{seq(-3, -1, by = 0.1)}
psi_r <- expand_grid(t = 1:1000, r = r_seq) %>%
group_by(r) %>%
mutate(
cdf = pmap_dbl(list(t, r), ~ compute_psi(.x, 1, c = c, n0 = 1, r = .y)),
pdf = c(0, diff(cdf))
)
psi_r_long <- psi_r %>%
pivot_longer(
cols = c("pdf", "cdf"),
names_to = "type",
values_to = "value"
)
psi_r_long %>%
mutate(type = toupper(type)) %>%
ggplot(aes(x = t, y = value, col = r, group = r)) +
geom_line() +
scale_color_continuous(trans = "log10", type = "viridis") +
facet_wrap(~type, scales = "free") +
labs(title = "Effect of changing the growth rate parameter on time to detection",
subtitle = paste0("Probability of detection is fixed as ", c),
x = "Day", y = "P(At least one case detected)", col = "Growth rate, r") +
theme_minimal()
c_seq <- 10^{seq(-7, -2, by = 0.25)}
psi_c <- expand_grid(t = 1:1000, c = c_seq) %>%
group_by(c) %>%
mutate(
cdf = pmap_dbl(list(t, c), ~ compute_psi(.x, 1, c = .y, n0 = 1, r = r)),
pdf = c(0, diff(cdf))
)
psi_c_long <- psi_c %>%
pivot_longer(
cols = c("pdf", "cdf"),
names_to = "type",
values_to = "value"
)
psi_c_long %>%
mutate(type = toupper(type)) %>%
ggplot(aes(x = t, y = value, col = c, group = c)) +
geom_line() +
scale_color_continuous(trans = "log10", type = "viridis") +
facet_wrap(~type, scales = "free") +
labs(title = "Effect of changing the probability of detection parameter",
subtitle = paste0("Growth rate is fixed as ", r),
x = "Day", y = "P(At least one case detected)", col = "Probability of detection, c") +
theme_minimal()
So, it looks to me like changing the probability of detection on the \(\log(10)\) scale results in linear scaling of the time to detection. Whereas, changes to the growth rate, again on the \(\log(10)\) scale, result in more substantial coolings of the time to detection.
Let’s fit the same model as before, to see about how quickly
exponential growth can be detected in this instance. Again, we have to
fit models starting from after the first detection has occurred due to
the limitations of glm
. The number of cases gets very large
past day 1500, see figure below, so we will limit interest to \(t < 1000\).
sim %>%
ggplot(aes(t, x)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = lambda), color = cbpalette[1]) +
labs(x = "Day", y = "Detections") +
theme_minimal()
Let’s filter down to the early stages of the outbreak an plot the detections and confidence intervals for \(\beta\).
fits <- tibble(today = seq(first_detection + 1, 1000)) %>%
mutate(
fit = map(today, ~glm(x ~ t, family = poisson, data = sim[seq(.x),])),
fit = map(fit, broom::tidy, conf.int = TRUE)
) %>%
unnest(fit)
p1 <- sim %>%
filter(t < 1000) %>%
ggplot(aes(t, x)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = lambda), color = cbpalette[1]) +
labs(x = "Day", y = "Detections") +
theme_minimal()
p2 <- fits %>%
mutate(above_zero = as.logical((conf.low > 0) & (conf.high > 0))) %>%
filter(term == 't', today < 1e3) %>%
ggplot(aes(today, y = estimate, ymax = conf.high, ymin = conf.low, col = above_zero)) +
geom_hline(yintercept = c(0, r), color = "black", linetype = "dashed") +
geom_pointrange(alpha = 0.5) +
scale_colour_manual(values = cbpalette[c(2, 3)]) +
expand_limits(x = 1) +
labs(x = "Day", y = "Estimated slope", col = "CI above zero?") +
theme_minimal()
p1 / p2
For this seed, after the first detection the confidence interval for \(\beta\) is always above zero. In other instances, the behaviour has been more interesting (e.g. after initial detection \(\beta > 0\), but \(\beta = 0\) because plasuible again after a while with no detections).
As a sidenote, perhaps it’s possible to write a hotstart version of the Poisson GLM which is initialised at the previous value and would make things faster, along the lines of this (but this would be slow in R due to looping I think):
hotstart_poisson_glm <- function(times) {
out <- list()
start_time <- times[1]
remaining_times <- times[-1]
fit <- glm(x ~ 1 + t, family = poisson, data = sim[1:start_time, ])
out[[start_time]] <- fit
for(time in remaining_times) {
fit <- glm(x ~ 1 + t, family = poisson, data = sim[1:time, ], start = fit$coefficients)
out[[time]] <- fit
}
return(out)
}
Let’s have a closer look at the time that the first detection starts:
p2 & xlim(c(first_detection - 10, first_detection + 100))
The confidence interval for the \(\beta\) parameter is above zero for many days following the first detections. After detecting a single case, having not detected one for hundreds of days, it’s not really possible that there just happened to be zeros for that long, so there must be some kind of slope to the rate. This occurs because the Poisson GLM model we specify is not very flexible: the only possible relationship we have allowed so far is linear, that \(\log(\lambda_t) = \beta_0 + \beta t\).
As a demonstration, suppose we consider another model specification \(\log(\lambda_t) = \beta_0 + \beta \log(t)\).
fits_log <- tibble(today = seq(first_detection + 1, 1000)) %>%
mutate(
fit = map(today, ~glm(x ~ log(t), family = poisson, data = sim[seq(.x),])),
fit = map(fit, broom::tidy, conf.int = TRUE)
) %>%
unnest(fit)
We will now move on to considering a (very toy) case study. Let \(T_d\) be the doubling time of a virus, which we will modelled after SARS-CoV-2 and take a doubling time of \(T_d = 3\) days; which has been noted for European countries here, though this article reports a faster doubling time for Chinese provinces, including 2.5 for Hubei province. To recover the exponential growth rate from the doubling time, calculate \[ e^{rT_d} = 2 \implies r = \log2 / T_d \]
Note, that this RAND article gives a 10X increase over ~1 week, which is a substantially faster doubling time (as shown by this figure in the report where the number of infected travelers was 0.1/day on Jan 22 and 1/day on Jan 29.)
wuhan_tianhe_yearly <- 20000000
wuhan_tianhe_daily <- wuhan_tianhe_yearly / 365
wuhan_tianhe_daily
## [1] 54794.52
wuhan_population <- 11000000
prob_airport <- wuhan_tianhe_daily / wuhan_population
The Wuhan Tianhe International Airport serves around 20000000 people per year, corresponding to 54795 people per day. The population of Wuhan is 11000000. Let’s halve the total population served by the airport to get only outgoing flights. The percentage of the population in the catchment area (loosely defined) who use the airport per day \(p_a\) is 0.498132%. This corresponds to an average of 1.8181818 flights per person per year, is potentially a bit high, but seems relatively plausible.
prob_detect <- 0.1
Let’s imagine that we had an environmental monitoring site set-up at Wuhan Tianhe. Suppose that each person passing through the airport with SARS-CoV-2 has a probability of being detected \(p_d\) of 0.1 (note that this particular figure is not based on any particular insights). The probability of detecting a case at the airport is a product of two factors, the probability that someone someone infected uses the airport, and the probability that they’re detected, so that this case is analogous to \[ c = p_a \times p_d \]
doubling_time <- 3
r <- log(2) / doubling_time
prob_detected <- 0.1
set.seed(42)
sim <- tibble(t = seq(-30, 50)) %>%
mutate(
lambda_total = exp(r * t),
lambda_travel = prob_airport * lambda_total,
lambda_detect = prob_detect * lambda_travel,
n_total = rpois(n(), lambda_total),
n_travel = rbinom(n(), n_total, prob = prob_airport),
n_detect = rbinom(n(), n_travel, prob = prob_detect)
)
sim_long <- sim %>%
pivot_longer(c(starts_with('n_'), starts_with('lambda_')),
names_to = c('var', 'type'),
names_sep = '_'
) %>%
pivot_wider(names_from = var) %>%
print
## # A tibble: 243 × 4
## t type n lambda
## <int> <chr> <dbl> <dbl>
## 1 -30 total 0 0.000977
## 2 -30 travel 0 0.00000486
## 3 -30 detect 0 0.000000486
## 4 -29 total 0 0.00123
## 5 -29 travel 0 0.00000613
## 6 -29 detect 0 0.000000613
## 7 -28 total 0 0.00155
## 8 -28 travel 0 0.00000772
## 9 -28 detect 0 0.000000772
## 10 -27 total 0 0.00195
## # … with 233 more rows
first_case <- function(x) {
sim_long %>%
filter(
type == x,
n > 0
) %>%
summarise(min_t = min(t)) %>%
pull(min_t)
}
t_first_total <- first_case("total")
t_first_travel <- first_case("travel")
t_first_detect <- first_case("detect")
p1 <- sim_long %>%
mutate(type = dplyr::recode_factor(type, "total" = "Total", "travel" = "Travel", "detect" = "Detect")) %>%
ggplot(aes(t, n, color = type)) +
geom_point() +
geom_line(aes(y = lambda)) +
geom_vline(aes(xintercept = t_first_total), linetype = "dashed") +
geom_vline(aes(xintercept = t_first_travel), linetype = "dashed") +
geom_vline(aes(xintercept = t_first_detect), linetype = "dashed") +
facet_wrap(~type, scales = "free", ncol = 1) +
scale_colour_manual(values = cbpalette) +
labs(x = "Day", y = "Count", col = "Stage") +
theme(
legend.position = "bottom"
) +
theme_minimal()
p1
The lag in days \(\ell\) between cases in the population and cases starting to pass through the airport is such that \[ p_ae^{r(t + \ell)} = e^rt \implies \ell = \log(1 / p_a) / r = T_d \log(1 / p_a) / \log(2) = T_d \log(1 / p_d) \] which evaluates to 22.9 days. Similarly, the lag in days between cases passing through the airport and being detected in the airport is another 10 days.
In this example, the first case occurs on day -14, the first case at the airport on day 19 and the first case to be detected on day 26.