knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  autodep = TRUE,
  cache.comments = FALSE
)

options(scipen = 999)

cbpalette <- multi.utils::cbpalette()
library(tidyverse)
library(cowplot)
library(patchwork)
theme_set(theme_minimal())

1 Background

Any biological threat will necessarily undergo exponential growth. Detecting this growth pattern is therefore a promising computational strategy for a threat-agnostic early warning system, as described by The Nucleic Acid Observatory Consortium (2021).

To develop and evaluate this idea, this quarter we aim to:

OKR 1.4: Develop an initial (computational) pipeline for exponential growth detection and execute it on simulated and spike-in metagenomic data

Our computational pipeline can be approximately divided into a bioinformatics stage and a statistics stage. The bioinformatics stage was the subject of Jeff’s memo last week, and currently outputs a collection of \(k\)-mer equivalence class time series1.

Given these time series, the goal of the statistical stage is to infer which are in an exponential growth regime, using some hypothetical function egd() – “exponential growth detection”. This function should be capable of identifying threats without an insurmountable2 number of false positives. It also needs to be computationally efficient enough to scale to a large number of time series being prohibitively slow.

2 Why Poisson regression could form the basis for egd()

set.seed(1)

r <- 0.2

df <- data.frame(t = 1:14) %>%
  mutate(
    lambda = 0.1 * exp(r * t),
    count = rpois(t, lambda = lambda)
  )

ggplot(df, aes(x = t, y = count)) +
  geom_point(col = cbpalette[3], size = 2.5) +
  geom_line(aes(y = lambda), col = cbpalette[1]) +
  theme_minimal() +
  labs(x = "Day", y = "Number of copies observed")
Example of an exponential curve in blue with randomly generated count data in orange.

Figure 2.1: Example of an exponential curve in blue with randomly generated count data in orange.

Our metagenomic data has two important properties that we would like egd() to take into account:

  1. The observation process is random
  2. The counts we observe take discrete, integer values (like “0” or “239”)

Both of these features make it harder to detect exponential growth, as illustrated by Figure 2.1: the observed orange counts are noisy, sometimes increasing and sometimes decreasing, unlike the monotonically increasing blue line; furthermore the blue line can take any value, whereas the orange counts are only ever 0, 1, or 2 (in this example).

One of the simplest probability distributions suitable for count data is the Poisson distribution. The Poisson distribution describes the probability of observing an outcome a certain given number of times when there are a large number of independent opportunities for the outcome to occur, but the chance of the outcome occurring at each opportunity is small.

The Poisson distribution is well-suited for counts generated by sequencing. Think of the sequencing process in the abstract as choosing a very large number (e.g. 1M) of DNA segments of a given length (e.g. 100bp) in a sample, effectively at random, and writing the resulting sequences into a file. The number of times a particular 40-mer appears is then approximately Poisson distributed.

plot1 <- data.frame(x = 0:3) %>%
  mutate(y = dpois(x, lambda = 0.1)) %>%
  ggplot(aes(x = x, y = y)) +
    geom_col(fill = cbpalette[7], alpha = 0.7) +
    theme_minimal() +
    lims(y = c(0, 1)) +
    labs(x = "Count", y = "Probability of observing that count", subtitle = "lambda = 0.1")

plot2 <- data.frame(x = 0:7) %>%
  mutate(y = dpois(x, lambda = 1)) %>%
  ggplot(aes(x = x, y = y)) +
    geom_col(fill = cbpalette[7], alpha = 0.7) +
    theme_minimal() +
    lims(y = c(0, 0.5)) +
    labs(x = "Count", y = "Probability of observing that count", subtitle = "lambda = 1")

plot3 <- data.frame(x = 0:25) %>%
  mutate(y = dpois(x, lambda = 10)) %>%
  ggplot(aes(x = x, y = y)) +
    geom_col(fill = cbpalette[7], alpha = 0.7) +
    theme_minimal() +
    lims(y = c(0, 0.13)) +
    labs(x = "Count", y = "Probability of observing that count", subtitle = "lambda = 10")

cowplot::plot_grid(plot1, plot2, plot3, ncol = 3)
Examples of the Poisson distribution with rate parameters 0.1, 1 and 10.

Figure 2.2: Examples of the Poisson distribution with rate parameters 0.1, 1 and 10.

The Poisson distribution is characterized by a single parameter, \(\lambda\), which equals the expected value (or mean) of the count \(C\); that is, \(\mathbb{E}[C] = \lambda\). Figure 2.2 shows an examples of what the Poisson distribution looks like for \(\lambda = 0.1, 1, 10\): as \(\lambda\) gets larger it looks more and more like a Gaussian distribution, but for small \(\lambda\) the difference is clear. In our sequencing example, the expected number of times we will observe a particular 40-mer in the data equals its proportion or relative abundance in the original sample multiplied by the total number of 40-mers in the data (here, equal to 1M * (100 - 40 + 1) = 59M). A \(k\)-mer that is twice as abundant as another in the sample has a value of \(\lambda\) (and expected sequencing count) that is twice as high.

Let \(C_t\) be the count of the \(k\)-mer in the sample from day \(t\). We can assess whether a \(k\)-mer is exponentially increasing as follows. First, we stipulate that the \(\lambda\) value for the \(k\)-mer–its expected count–has an exponential trend, either increasing or decreasing, \[ \lambda_t = \exp(a + r t), \] Here, \(r\) is the exponential growth rate of the k-mer, with \(r > 0\) corresponding to growth and \(r < 0\) corresponding to decay.

We then infer the parameters \(a\) and \(r\) from data using a method called Poisson regression. You can think of Poisson regression as being an extension of linear regression. In fact, the equation \(a + rt\) specifies a straight line, where \(a\) is the intercept and \(r\) is the slope.

fit <- glm(count ~ 1 + t, family = "poisson", data = df)
p_val <- summary(fit)$coefficients[2, 4]

df$lambda_mean <- fit$fitted.values

ci <- predict(fit, newdata = data.frame(t = 1:14), se.fit = TRUE, type = "link")
upr_latent <- ci$fit + 2 * ci$se.fit
lwr_latent <- ci$fit - 2 * ci$se.fit
df$lambda_upr <- fit$family$linkinv(upr_latent) #' This is just exp()
df$lambda_lwr <- fit$family$linkinv(lwr_latent)

ggplot(df, aes(x = t, y = count)) +
  geom_point(col = cbpalette[3], size = 2.5) +
  geom_line(aes(y = lambda), col = cbpalette[1]) +
  geom_line(aes(y = lambda_mean), col = cbpalette[2]) +
  geom_ribbon(aes(ymin = lambda_lwr, ymax = lambda_upr, x = t, fill = "band"), fill = cbpalette[2], alpha = 0.1) +
  theme_minimal() +
  labs(x = "Day", y = "Number of copies observed")
The green line shows the fitted model.

Figure 2.3: The green line shows the fitted model.

Poisson regression is a well-studied approach, with many tools available, including algorithms that we can use to obtain estimates for \(a\) and \(r\). For the data above, the estimates we obtain give the curve \(\exp(-1.227 + 0.095t)\), shown in Figure 2.3. To determine if the time-series is exponentially growing, we look at inference3 for the slope \(r\): if it looks like \(r\) is large, then we can be more sure that we’re looking at exponential growth. In this particular example, the extent of the exponential growth is under-estimated4.

In preliminary testing, we have found that Poisson regression works well in simple situations.

3 Challenges ahead

We have also identified a number of issues and open questions to be addressed as we continue to develop the approach and begin testing on more realistic, challenging situations. Some of these issues we know how to address, and others we aren’t sure but have a good idea for where to start.

3.1 Accounting for sequencing depth and other multipliers

The counts we observe are proportional to how much sequencing depth we pay for–if we pay for 10X more sequencing reads, we will (on average) get 10X higher counts for each k-mer. Further, due to the idiosyncrasies of sample and library preparation, we will get more sequencing reads from some wastewater samples than others. This variation will tend to mask the true changes in the wastewater composition if not properly accounted for.

We can account for differences in sequencing depth by adding an exposure term to our Poisson regression model. Let \(F_t\) be a measure of sequencing effort at time \(t\), such as the total number of reads or the total number of feature counts.5 We then use Poisson regression to fit the model \[ \lambda_t = \exp(a + r t) \cdot F_t. \]

Under this model, the exponential growth rate \(r\) represents the growth rate of the proportion of the \(k\)-mer among all \(k\)-mers. We can change the exposure term so that \(r\) represents other quantities. For example, we could set the exposure equal to the number of reads from crAssphage or PMMoV, the most abundant DNA and RNA viruses in human feces. Then, \(r\) represents the growth rate of the \(k\)-mer relative to these fecal markers, which can account for fluctuations in the total amount of fecal shedding each day.

3.2 Additional sources of noise

We expect that real counts derived from an exponentially growing virus have a greater variance than implied by Poisson distribution, due to the presence of additional randomness from viral shedding, wastewater transport, and sample collection and processing. The problem of greater variance in the data than implied by the probability model is known as over-dispersion of the data.

Over-dispersion tends to make the results of Poisson regression overconfident: it may tell us that a sequence is very likely to be exponentially increasing when really it is just randomly jumping around more than expected. This problem can be overcome by using generalizations of Poisson regression such as gamma-Poisson or negative binomial regression, which allow for increased variance in the observed counts, but more complex and thus slower to fit.

3.3 Only allows pure exponential curve shapes

set.seed(1)

df <- tibble(t = 1:14) %>%
  mutate(
    lambda = 10 * (1.1 + cos(2 * pi / 14 * t - 0.0 * pi)),
    count = rpois(t, lambda = lambda)
  )

fit <- glm(count ~ 1 + t, family = "poisson", data = df)

df$lambda_mean <- fit$fitted.values

ci <- predict(fit, newdata = data.frame(t = 1:14), se.fit = TRUE, type = "link")
upr_latent <- ci$fit + 2 * ci$se.fit
lwr_latent <- ci$fit - 2 * ci$se.fit
df$lambda_upr <- fit$family$linkinv(upr_latent) #' This is just exp()
df$lambda_lwr <- fit$family$linkinv(lwr_latent)

ggplot(df, aes(x = t, y = count)) +
  geom_point(col = cbpalette[3], size = 2.5) +
  geom_line(aes(y = lambda), col = cbpalette[1]) +
  geom_line(aes(y = lambda_mean), col = cbpalette[2]) +
  geom_ribbon(aes(ymin = lambda_lwr, ymax = lambda_upr, x = t, fill = "band"), fill = cbpalette[2], alpha = 0.1) +
  theme_minimal() +
  labs(x = "Day", y = "Number of copies observed")
The green fitted line is unable to recover the underlying blue sinusiodal structure, which generates the observed orange counts.

Figure 3.1: The green fitted line is unable to recover the underlying blue sinusiodal structure, which generates the observed orange counts.

The model is not very flexible: it can only describe pure exponential growth and decay. As a result we may impose an exponential growth pattern onto other types of pattern (e.g. linear, sinusiodal) as there are no other options (Figure 3.1) A more flexible model might allow for these other types of pattern, either explicitly (parametrically) or inexplicitly (non-parametrically). Hard-coding in other common patterns may be enough, but ideally using a non-parametric6 model like a Gaussian process would be preferable. These more complex models will again be slower to fit.

3.4 Decision procedures

We are testing many time series for exponential growth simultaneously. Because of the multiple comparisons problem, naively using \(p\)-values will result in many false positives. Although it’s possible to control for error rates by modifying the significance levels required, there are reasons to believe that this issue isn’t a major worry.

The problem of early detection could be thought of as trying to find a needle in a haystack. In this framing, we might not be too concerned if egd() catches some hay, as long as it doesn’t discard the needle. If egd() substantially filters down the number of \(k\)-mers to something manageable, we could then run more sophisticated analyses on these interesting \(k\)-mers. This might include more complex statistical time-series models, reintegrating the bioinformatic aspects, or even generate new (possibly targetted) data to analyse.

3.5 Model evaluation

When developing a model, it’s important to decide how to evaluate performance, as this determines what the model is good at. For example, we could choose to penalise false negatives (classifying as not exponentially, when the truth is exponential) more than we penalise false positives (classifying as exponential, when the truth is not exponential): that is \[ \text{cost}(\text{FN}) > \text{cost}(\text{FP}), \] where the particular costs involved could be tuned as we’d like. This would result in a model which urrs on the side of flagging things as being exponentially increasing.

Another decision we have to make is if we care about detecting exponential regimes (just saying whether the time-series is exponentially increasing) or time series forecasting (using that counts at time \(t\) to predict future counts). In the later case, accurately estimating the growth rate \(r\) is essential, but in the former case we only care about the growth rate in so far as it is above zero. In a similar vein, we might also want to decide whether we care more about detecting high growth more than we care about low growth.

3.6 Behaviour for new observations

Note: this section was omitted from the memo as originally presented.

df_new <- data.frame(
  t = 1:14,
  count = c(rep(0, 13), 1)
)

fit_new <- glm(count ~ 1 + t, family = "poisson", data = df_new)

df_new$lambda_mean <- fit_new$fitted.values
df_new$eta_mean <- fit_new$linear.predictors

ggplot(df_new, aes(x = t, y = log(count))) +
  geom_point(col = cbpalette[3], size = 2.5) +
  geom_line(aes(y = eta_mean), col = cbpalette[2]) +
  theme_minimal() +
  labs(x = "Day", y = "log(Number of copies observed)")

The model behaves strangely in some cases. If a “1” is observed after many days of “0” then it’s possible to perfectly fit a line through the data by putting the intercept \(\beta_0\) as negative and the slope \(\beta\) as positive. This issue can be remedied by placing priors to provide some form of regularisation.

4 Computational suitability

The following table shows how long our initial script for running Poisson regression takes when there are 10, 100, 1000 and 10000 rows:

so_formatter <- function(x) {
  dplyr::case_when(
      x < 1e3 ~ as.character(x),
      x < 1e6 ~ paste0(as.character(x / 1e3), "K"),
      x < 1e9 ~ paste0(as.character(x / 1e6), "M"),
      x < 1e12 ~ paste0(as.character(x / 1e9), "G"),
      TRUE ~ "To be implemented..."
  )
}

pivot <- data.frame(
  type = "pivot",
  rows = c(10, 100, 1000, 10000),
  time = c(2.61E-3, 2.17E-3, 2.43E-3, 3.88E-3),
  mem = c(32.09E3, 92.3E3, 693.47E3, 6.55E6)
)

glm <- data.frame(
  type = "glm",
  rows = c(10, 100, 1000, 10000),
  time = c(47.4E-3, 360.6E-3, 4.3, 49.6),
  mem = c(418.8E3, 6.3E6, 296.1E6 , 25.7E9)
) 

tidy <- data.frame(
  type = "tidy",
  rows = c(10, 100, 1000, 10000),
  time = c(1.42E-3, 1.13E-3, 1.68E-3, 15.82E-3 ),
  mem = c(18E3, 18.9E3, 25.9E3 ,96.2E3)
) 

df <- bind_rows(pivot, glm, tidy) %>%
  group_by(rows) %>%
  summarise(
    time = sum(time),
    mem = sum(mem)
  ) %>%
  mutate(
    mem = so_formatter(signif(mem, digits = 3))
  )

df %>%
  gt::gt() %>%
  gt::cols_align(
    align = "left",
    columns = rows
  ) %>%
  gt::cols_align(
    align = "right",
    columns = c("time", "mem") 
  ) %>%
  gt::cols_label(
    rows = "Number of rows",
    time = "Time (s)",
    mem = "Memory (B)"
  ) %>%
  gt::fmt_number(
    columns = time,
    decimals = 2
  )
Number of rows Time (s) Memory (B)
10 0.05 469K
100 0.36 6.41M
1000 4.30 297M
10000 49.62 25.7G
time_10K_s <- signif(max(df$time), 1)
time_1M_s <- time_10K_s * 100
time_1M_h <- signif(time_1M_s / 60^2, 1)

The most significant part of the code is running the R function stats::glm() which fits the Poisson regression model. It looks like the time taken is approximately linear (which would make sense as the operation is embarrassingly parallel). Assuming linearity, then 50 seconds for 10K rows corresponds to 1 hour for 1M rows. This computation time could easily be distributed across multiple machines.

Although this may seem quite slow, we expect Poisson regression to be computationally cheaper than other possible statistical approaches. As well, since little optimisation effort has gone into this code, so it’s likely that significant improvements can be made.

5 Conclusion

Poisson regression is a reasonable first exponential growth detection approach, and there is a lot of room for iteration and improvement.

Bibliography

The Nucleic Acid Observatory Consortium. 2021. “A Global Nucleic Acid Observatory for Biodefense and Planetary Health.” arXiv Preprint arXiv:2108.02678.

  1. This earlier stage of the pipeline may change, but the statistical methods we are developing will likely be relevant irrespective of the precise bioinformatic details.↩︎

  2. We don’t know exactly what would constitute “insurmountable”. Ultimately what matters is the performance of the overall pipeline, not the egd() component in isolation.↩︎

  3. By the term “inference”, we mean a point estimate \(\hat r\), a (95%) confidence interval \(\hat r \pm 1.96 \times \text{se}(\hat r)\), or a posterior distribution \(p(r \, | \, y)\).↩︎

  4. The \(p\)-value for \(r \neq 0\) is 0.2712168: too high to be statistically significant at a 95% confidence level↩︎

  5. It often doesn’t matter which, since the two will typically be proportional, and the factor difference will only affect the intercept \(a\).↩︎

  6. This actually in some sense means a model with infinite parameters.↩︎