Exponential growth detection on metagenomic data

Adam Howes

Abstract

Background We can test approaches to exponential growth detection using simulated metagenomic time-series data. We have generated this data in the notebook “Simulating metagenomic time series data”.

Task In this notebook we will import the simulated data, perform and clarify the Poisson regression approach to exponential growth detection, and measure performance against the ground truth using a range of approaches.

Findings

Next steps

knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  autodep = TRUE,
  cache.comments = FALSE
)

options(scipen = 999)

cbpalette <- multi.utils::cbpalette()

library(tidyverse)

First simulated example

We start by importing the data generated in Introduction via simple version, Simulating metagenomic time series data.

df <- readRDS("depends/sample0.rds") %>%
  as_tibble()

A full description of the data is provided in the linked notebook, but to give a quick idea: the plot below shows each draw from the two regimes, baseline and exponential, shown as feint lines, together with corresponding regime 95% intervals shown as shaded region and regime mean shown as bold line.

sample_summary <- df %>%
  group_by(day, regime) %>%
  summarise(
    count_upper = quantile(count, 0.95),
    count_median = median(count),
    count_lower = quantile(count, 0.05)
  ) %>%
  ungroup()

## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.

ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper, group = regime)) +
    geom_ribbon(alpha = 0.1, aes(fill = regime)) +
    geom_line(aes(col = regime), size = 1.5) +
    geom_line(data = df, aes(x = day, y = count, col = regime, group = id),
               alpha = 0.025, inherit.aes = FALSE) + 
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    scale_fill_manual(values = cbpalette) +
    guides(fill = "none") +
    labs(x = "Day", y = "Number of reads in sample", col = "Regime")

Performing inference

We fit a Poisson regression model (see Explaining Poisson regression) or Estimating exponential growth via sporadic detection) to each \(k\)-mer in the data. It is important to note that the generative process used to create the simulated data may be of greater complexity than Poisson regression inferential model. We take this approach because we expect any functional NAO system to need to be able to handle a large number of \(k\)-mers (or \(k\)-mer equivalence classes, etc.), and therefore we require the inferential approach to be highly scalable. Poisson regression is a comparatively simple inferential approach, which we hope can be shown via benchmarking (see Benchmarking Poisson regression for exponential growth detection) to be scalable to the number of \(k\)-mers we expect, which might be¹ on the order of \(10^9\). If Poisson regression is feasible computationally, but has particular unsatisfactory statistical behaviour, then we could consider a more sophisticated inferential model to overcome specific problems.

#' Fit to each unique id
#' Note that (confusingly) kmer is not unique, and represents the kmer within a given organism
fits <- tibble(id = unique(df$id)) %>%
  mutate(
    fit = map(id, ~glm(count ~ 1 + day, family = "poisson", data = filter(df, id == .))),
    fit = map(fit, broom::tidy, conf.int = TRUE)
  ) %>%
  unnest(fit)

#' Add truth column
fits <- fits %>%
  left_join(
    select(df, id, regime),
    by = "id"
  )

Now we can look at the inference for the slope parameter \(\beta\) in each regression, first the point estimate with confidence interval, then the \(p\)-value for \(\beta \neq 0\). In the following two plots the truth – either exponential or baseline regime – is shown by the point colour.

fits_day <- filter(fits, term == "day")

fits_day %>%
  ggplot(aes(x = id, y = estimate, ymin = conf.low, ymax = conf.high, col = regime)) +
    geom_pointrange(alpha = 0.05, size = 0.5) +
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    labs(x = "k-mer ID", y = "Estimated slope", col = "True regime") +
    theme(
      legend.position = "bottom"
    )

fits_day %>%
  ggplot(aes(x = id, y = p.value, col = regime)) +
  geom_point() +
  theme_minimal() +
  scale_color_manual(values = cbpalette) +
  labs(x = "k-mer ID", y = "p-value for non-zero slope", col = "True regime") +
  theme(
    legend.position = "bottom"
  )

saveRDS(fits, "fits-sample0.rds")

The fits_day object has columns including the \(k\)-mer ID, estimated slope and standard error, statistic (given by statistic = estimate / std.error), p-value (given by p.value = 2 * (1 - pnorm(abs(statistic)))), lower 95% confidence interval (given by estimate - qnorm(0.975) * std.error) and upper 95% confidence interval (given by estimate + qnorm(0.975) * std.error).

Performance assessment

Suppose we classify as exponential growth when the 95% confidence interval for \(\beta\) is above zero. We can see which \(k\)-mers we misclassify by comparing this estimated regime to the true regime. In the following two plots we show \(k\)-mers correctly classified in grey and \(k\)-mers incorrectly classified in red.

fits_day <- fits_day %>%
  mutate(
    est_regime = case_when(
      conf.low > 0 & conf.high > 0 ~ "Exponential",
      TRUE ~ "Baseline"
    ),
    correct = case_when(
      regime == est_regime ~ TRUE,
      TRUE ~ FALSE
    )
  )

fits_day %>%
  ggplot(aes(x = id, y = estimate, ymin = conf.low, ymax = conf.high, col = correct)) +
    geom_pointrange(alpha = 0.05, size = 0.5) +
    theme_minimal() +
    scale_color_manual(values = c("#D22B2B", "grey")) +
    labs(x = "k-mer ID", y = "Estimated slope", col = "True regime") +
    theme(
      legend.position = "bottom"
    )

fits_day %>%
  ggplot(aes(x = id, y = p.value, col = correct)) +
  geom_point() +
  theme_minimal() +
    scale_color_manual(values = c("#D22B2B", "grey")) +
  labs(x = "k-mer ID", y = "p-value for positive slope", col = "True regime") +
  theme(
    legend.position = "bottom"
  )

All of the \(k\)-mers which are exponentially increasing are classified as such. In other words, the false negative rate is zero. There are a small number of false positives: \(k\)-mers in the baseline regime which are classified as exponentially increasing. All of the values in the confusion matrix are given by:

fits_day <- fits_day %>%
  mutate(
    cm = case_when(
      regime == "Exponential" & est_regime == "Exponential" ~ "True positive",
      regime == "Exponential" & est_regime == "Baseline" ~ "False negative",
      regime == "Baseline" & est_regime == "Exponential" ~ "False positive",
      regime == "Baseline" & est_regime == "Baseline" ~ "True negative"
    )
  )

fits_day %>%
  janitor::tabyl(cm) %>%
  knitr::kable()

cm	n	percent
False positive	28	0.0033333
True negative	7532	0.8966667
True positive	840	0.1000000

We highlight the few false positives in red in the following plot:

df %>%
  left_join(
    select(fits_day, id, correct, cm),
    by = "id"
  ) %>%
  ggplot(aes(x = day, y = count, col = correct, alpha = correct, group = id)) +
    geom_line() + 
    theme_minimal() +
    scale_alpha_manual(values = c(1, 0.05)) +
    scale_color_manual(values = c("#D22B2B", "grey")) +
    guides(alpha = "none") + 
    labs(x = "Day", y = "Number of reads in sample", col = "Correctly classified?")

The Brier score (see Evaluating exponential growth detection) is given by \[ \text{BS} = \frac{1}{n} \sum_{i = 1}^n \left(f_i - o_i \right)^2 \] where \(f_i\) is \(\mathbb{P} \left[ \text{regime}(i) = 1 \right]\) and \(o_i = \text{regime}(i) \in \{0, 1\}\).

fits_day %>%
  mutate(
    prob = 1 - p.value,
    regime_numeric = case_when(
      regime == "Exponential" ~ 1,
      regime == "Baseline" ~ 0
    ),
    brier = (prob - regime_numeric)^2
    ) %>%
  group_by(regime) %>%
  summarise(average_brier = mean(brier))

#' Note: relation of average Brier score in baseline regime to
mean(runif(1000)^2)

## [1] 0.3234058

With additional true abundance noise

df <- readRDS("depends/sample1.rds") %>%
  as_tibble()

Real data

counts <- read.table("data/counts-1pct-sample.tsv", header = TRUE)
subset_size <- 1000

The real data we use has 1000000 total \(k\)-mers, and looks like:

head(counts)

We will use a subset of size 1000, and pivot the data structure as follows:

counts_subset <- filter(counts, ec <= subset_size)

#' In a tidy format for plotting and fitting
counts_subset_long <- counts_subset %>%
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    names_prefix = "day",
    values_to = "count",
    names_transform = list(day = as.integer, count = as.integer)
  )

head(counts_subset_long, n = 10)

The data is now in a form where it can be easily plotted:

ggplot(counts_subset_long, aes(x = day, y = count, group = ec)) +
  geom_line(alpha = 0.05) +
  theme_minimal() +
  labs(x = "Day", y = "Count")

#' There is also a file which I assume is the total reads
row_sums <- read.table("data/daily-counts.tsv", header = TRUE)

Performing inference and inference assessment

fits <- tibble(ec = counts_subset$ec) %>%
  mutate(fit = map(ec, ~glm(count ~ 1 + day, family = "poisson", data = filter(counts_subset_long, ec == .)))) %>%
  mutate(fit = map(fit, broom::tidy, conf.int = TRUE)) %>%
  unnest(fit)

fits_day <- filter(fits, term == "day") %>%
  mutate(
    est_regime = case_when(
      conf.low > 0 & conf.high > 0 ~ "Exponential",
      TRUE ~ "Baseline"
    )
  )

fits_day %>%
  ggplot(aes(x = ec, y = estimate, ymin = conf.low, ymax = conf.high, col = est_regime)) +
    geom_pointrange(alpha = 0.5, size = 0.5) +
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    labs(x = "k-mer ID", y = "Estimated slope", col = "Estimated regime") +
    theme(
      legend.position = "bottom"
    )

fits_day %>%
  ggplot(aes(x = ec, y = p.value, col = est_regime)) +
  geom_point() +
  scale_color_manual(values = cbpalette) +
  theme_minimal() +
  labs(x = "k-mer ID", y = "p-value for non-zero slope", col = "Estimated regime") +
  theme(
    legend.position = "bottom"
  )

saveRDS(fits_day, "fits-counts.rds")

counts_subset_long %>%
  left_join(
    select(fits_day, ec, est_regime),
    by = "ec"
  ) %>%
  ggplot(aes(x = day, y = count, col = est_regime, alpha = est_regime, group = ec)) +
    geom_line() +
    facet_grid(~est_regime) +
    scale_color_manual(values = cbpalette) +
    scale_alpha_manual(values = c(0.05, 0.5)) +
    theme_minimal() +
    guides(alpha = "none") + 
    labs(x = "Day", y = "Number of reads in sample", col = "Estimated regime")

Ideas for making the inferential model more complex

Hierarchical mixture (different regimes) model

Possible concerns

We’re assuming that exponential increase in incidence will produce exponential signal in the \(k\)-mers (which will approximately be like prevalence)
- It’s an open question whether we need to be adjusting for this (which potentially more epidemiological modelling could help with)
Expect the signal to initially be subexponential due to reporting delays

To some extent, the number of \(k\)-mers can be varied in accordance with computational capacity.↩︎