Simulating metagenomic time series data

Adam Howes

Abstract

Background Rigorously benchmarking methods for exponential growth detection requires metagenomic datasets where the ground truth is known. Simulation is a flexible tool which could meet this need by enabling generation of datasets from known data generating processes. Thinking through how one might realistically simulate data is a furthermore a useful step in building understanding of the real data.

Task We aim to specify a hierarchical model for generation of metagenomic time series data, under (1) a baseline behaviour regime, and (2) an exponential growth regime. We expect the model to include modules for the behaviour of species under each regime, the relationship between species and k-mers, and the metagenomic sequencing observation process.

Findings This work is ongoing, and we believe that the task as specified is still of value. We preliminarily find that specifying an appropriate model for baseline species behaviour requires some thought, and in particular that the lognormal geometric random walk produces some undesirable features in the data.

Next steps The next steps are to (1) build an understanding of and incorporate models for sequencing, (2) think of simple ways to prevent the issue observed with random walks, and (3) save some simulated data-sets ready to try inference.

Introduction via simple version

n_species <- 10
baseline_indices <- 2:n_species
exponential_indices <- 1
n_bp <- 100
k <- 40
n_kmer <- n_bp - k
time_window <- 14
baseline <- 100
exp_initial <- 1
r <- 0.4
read_depth <- 10^4

Suppose that there are 10 species in total in the sample, and nothing else. Each species has a genome that is exactly 100 base-pairs long. Each genome is broken into \(k\)-mers of length 40. There are 60 \(k\)-mers from each species, obtained by subtracting the \(k\)-mer length from the genome size. Assume that each \(k\)-mer is unique, and there are no sequencing errors. We collect data over a period of 14 days \(t = 0, 1, \dots\) at a single environmental monitoring site.

Species-level baseline and exponential regimes

We consider two regimes 1) baseline and 2) exponential growth.

In the baseline regime, a species has a concentration \(c^{(b)}\) in water of 100 copy per \(\mu\)L¹, independent of the day. \[ c^{(b)}_t = c^{(b)} \] Assuming that all of the species in the baseline regime have the same constant concentration is unrealistic, and will make detection of exponential growth substantially easier.

In the exponential growth regime, a species starts at a concentration \(c^{(e)}_0\), and over the 14 day window its concentration increases exponentially according to \[ c^{(e)}_t = c^{(e)}_0 \exp(rt) \] where \(r\) is the growth rate which we take to be 0.4, and \(c^{(e)}_0\) is the initial exponential concentration which we take to be 1 – lower than the baseline concentration as we assume the exponentially increasing species to be novel.

The true concentration of the baseline species over the time window is:

conc_baseline <- rep(baseline, time_window)
conc_baseline

##  [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100

And the true concentration of the exponentially increasing species over the time window is:

conc_exponential <- exp_initial * exp(r * 0:13)
conc_exponential

##  [1]   1.000000   1.491825   2.225541   3.320117   4.953032   7.389056  11.023176  16.444647  24.532530  36.598234  54.598150  81.450869 121.510418 181.272242

Suppose that species 2, 3, 4, 5, 6, 7, 8, 9, 10 are in the baseline regime and species 1 is in the exponential regime, such that 11.1111111/% of species are in the exponential regime². We will represent the true concentrations of each species with a matrix C:

C <- matrix(
  data = NA,
  nrow = time_window,
  ncol = n_species
)

C[, exponential_indices] <- conc_exponential
C[, baseline_indices] <- conc_baseline
C

##             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   1.000000  100  100  100  100  100  100  100  100   100
##  [2,]   1.491825  100  100  100  100  100  100  100  100   100
##  [3,]   2.225541  100  100  100  100  100  100  100  100   100
##  [4,]   3.320117  100  100  100  100  100  100  100  100   100
##  [5,]   4.953032  100  100  100  100  100  100  100  100   100
##  [6,]   7.389056  100  100  100  100  100  100  100  100   100
##  [7,]  11.023176  100  100  100  100  100  100  100  100   100
##  [8,]  16.444647  100  100  100  100  100  100  100  100   100
##  [9,]  24.532530  100  100  100  100  100  100  100  100   100
## [10,]  36.598234  100  100  100  100  100  100  100  100   100
## [11,]  54.598150  100  100  100  100  100  100  100  100   100
## [12,]  81.450869  100  100  100  100  100  100  100  100   100
## [13,] 121.510418  100  100  100  100  100  100  100  100   100
## [14,] 181.272242  100  100  100  100  100  100  100  100   100

Sampling \(k\)-mers

We assume that the true concentration of each \(k\)-mer is that of the corresponding species. This may be represented by copying each column of the matrix C 100 times.

K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)

The matrix K now has 14 rows, one for each day, and 600 columns, one for each \(k\)-mer (of which there are 60 times 10). We can calculate proportions, where the total number of \(k\)-mers is given by the row sums:

K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
# useful::topleft(K_norm)
# useful::bottomleft(K_norm)

One way to represent the sequencing process is as a sample from the collection of \(k\)-mers. For example, we could consider a multinomial sample with probabilities of sampling each \(k\)-mer given by K_norm and sample size given by the read depth 10^{4}.

To demonstrate this, suppose we simulate the sequencing process on day 1. The proportions of each \(k\)-mer are given by K_norm[1, ], and we may sample using rmultinom. A histogram of the \(k\)-mer counts at day 1 under each regime, showing that the exponential regime is initialised at low count numbers, is given by:

sample_one <- rmultinom(1, read_depth, K_norm[1, ])

get_regime <- function(species_index) {
  if(species_index %in% baseline_indices) return("baseline")
  if(species_index %in% exponential_indices) return("exponential")
  else return(NA)
}

#' Testing that this function works as intended
stopifnot(purrr::map_chr(1:2, get_regime) == c("exponential", "baseline"))

data.frame(count = sample_one) %>%
  mutate(
    species = rep(1:n_species, each = n_kmer),
    regime = str_to_title(purrr::map_chr(species, get_regime)),
  ) %>%
  ggplot(aes(x = count, group = regime, fill = regime)) +
    geom_histogram(alpha = 0.8) +
    labs(x = "k-mer count", y = "Occurances", fill = "Regime", title = "k-mer counts at day 1") +
    scale_fill_manual(values = cbpalette) +
    theme_minimal()

Now, we will take multinomial samples from all of the days with a call to apply:

sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))

colnames(sample) <- paste0("day", 1:ncol(sample))
rownames(sample) <- paste0(1:nrow(sample))

sample_df <- sample %>%
  as.data.frame() %>%
  tibble::rownames_to_column("id") %>%
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    values_to = "count",
    names_prefix = "day"
  ) %>%
  mutate(
    id = as.numeric(id),
    day = as.numeric(day),
    kmer = rep(rep(1:n_kmer, each = time_window), times = n_species),
    species = rep(1:n_species, each = n_kmer * time_window),
    regime = str_to_title(purrr::map_chr(species, get_regime))
  )

saveRDS(sample_df, "sample0.rds")

The data frame sample_df filtered to the first \(k\)-mer from the first species is given by:

Let’s plot the data from species 1 which we have set to be exponentially increasing, together with the data from all other species which we have set to follow a baseline distribution:

reads_plot <- function(df) {
  
  sample_summary <- df %>%
    group_by(day, regime) %>%
    summarise(
      count_upper = quantile(count, 0.95),
      count_median = median(count),
      count_lower = quantile(count, 0.05)
    )

  ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper, group = regime)) +
    geom_ribbon(alpha = 0.1, aes(fill = regime)) +
    geom_line(aes(col = regime), size = 1.5) +
    geom_line(data = df, aes(x = day, y = count, col = regime, group = id),
               alpha = 0.05, inherit.aes = FALSE) + 
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    scale_fill_manual(values = cbpalette) +
    guides(fill = "none") +
    labs(x = "Day", y = "Number of reads in sample", col = "Regime")
}

reads_plot(sample_df)

sample_summary <- sample_df %>%
  group_by(day, regime) %>%
  summarise(
    count_upper = quantile(count, 0.95),
    count_median = median(count),
    count_lower = quantile(count, 0.05)
  )

pickout_day <- 10

With these settings, on day 10 the median count of species 1 is 18, 6 and \(k\)-mers from species 1 represent 10.8, 3.6% of the total reads. We expect the reads of the \(k\)-mers that we want to detect as exponentially increasing to be a small fraction of the total reads, which is not well captured by this simulation.

Now, we will start to make this simulation more realistic by adding sources of noise.

Adding noise to the true abundances

In reality, the species abundances will vary, neither being fixed to a constant or increasing deterministically. We will now suppose that the abundances vary stochastically as follows.

Lognormal geometric random walk

For the baseline concentration, one possible model is a lognormal geometric random walk such that \[ c^{(b)}_t = c^{(b)}_{t - 1} \times \exp \left( \epsilon^{(b)}_t \right) \] where \(\epsilon^{(b)}_t \sim \mathcal{N}(0, \sigma_b^2)\). Another way to calculate the baseline concentration at time \(t\), \(c^{(b)}_t\), is as \(c^{(b)}_t = c^{(b)}_{0} \times \exp \left( x^{(b)}_t \right)\) where \(x^{(b)}_t = \sum_{s = 1}^t \epsilon^{(b)}_s\) follows a random walk. The expected value and variance of this random walk are \[ \mathbb{E}[x^{(b)}_t] = \mathbb{E} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{E} \left[ \epsilon^{(b)}_s \right] = 0, \\ \mathbb{Var} [ x^{(b)}_t ] = \mathbb{Var} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{Var} \left[ \epsilon^{(b)}_s \right] = t\sigma_b^2 \] We can use the formula for the moment generating function of a Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), \(m_X(u) = \mathbb{E} \left[ e^{uX} \right] = \exp(\mu u + \sigma^2u^2 / 2)\), to calculate that the expected concentration at time \(t\) under this model is \[ \mathbb{E} \left[ c^{(b)}_t \right] = c^{(b)}_{0} \exp(\sigma_b^2/2) > c^{(b)}_{0}. \]

For the exponential regime, we could also suppose a geometric lognormal random walk \[ c^{(e)}_t = c^{(e)}_{t - 1} \times \exp \left( \epsilon^{(e)}_t \right) \] where \(\epsilon^{(e)}_t \sim \mathcal{N}(r, \sigma_e^2)\) and the growth rate \(r > 0\), and the expected concentration at time \(t\) under this model is \(\mathbb{E} \left[ c^{(e)}_t \right] = c^{(e)}_{0} \exp(rt + \sigma_e^2/2)\).

n_baseline_paths <- 10

To show how this behaviour looks, let’s simulate 10 samples from both the baseline and exponential regimes:

baseline_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Baseline"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = cumsum(epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

exponential_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths, mean = r),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Exponential"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = cumsum(epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

df <- bind_rows(exponential_df, baseline_df)

epsilon_x_conc_plot <- function(df, title) {
  df %>%
    pivot_longer(
      cols = c("epsilon", "x", "conc"),
      names_to = "variable",
      values_to = "value"
    ) %>%
    mutate(
      variable = recode_factor(variable, 
        "epislon" = "epsilon", 
        "x" = "x",
        "conc" = "Concentration")
    ) %>%
    ggplot(aes(x = day, y = value, group = id, col = regime)) +
      geom_line(alpha = 0.5) +
      facet_wrap(variable ~ regime, scales = "free", ncol = 2) +
      scale_colour_manual(values = cbpalette) +
      theme_minimal() +
      guides(col = "none") +
      labs(x = "Day", y = "", title = title)
}

epsilon_x_conc_plot(df, "Lognormal geometric random walk")

The key takeaway here about the lognormal geometric random walk model is that even though the noise is IID and Gaussian, when you take the cumulative sum and exponentiate it can lead to large deviations in concentration, which follows from the variance of a random walk increasing linearly in time. The maximum concentration value under the baseline regime observed was 2.3649815^{4} – which is 236.4981468 greater than the baseline concentration. It may be helpful to have some idea of the maximum sample concentration which is possible, and impose this as a (soft) constraint somehow.

Lognormal geometric auto-regressive process for baseline

rho <- 0.75

To remedy the issue observed above, we could use a auto-regressive process for \(x^{(b)}_t\) rather than a random walk. Such a model is specified recursively by \[ x^{(b)}_1 \sim \left( 0, \frac{1}{1 - \rho^2} \right), \\ x^{(b)}_t = \rho x^{(b)}_{t - 1} + \epsilon^{(b)}_t, \quad t = 2, \ldots, T, \] where the lag-one correlation parameter \(\rho \in (0, 1)\) determines the extent of autocorrelation, which we take to be 0.75, and \(\epsilon^{(b)}_t \sim \mathcal{N}(0, 1)\) as before.

baseline_ar_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Baseline"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = stats::arima.sim(model = list(order = c(1, 0, 0), ar = rho), n = time_window, innov = epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

df_ar <- bind_rows(exponential_df, baseline_ar_df)

epsilon_x_conc_plot(df_ar, "Lognormal geometric auto-regressive behaviour for the baseline")

We will now simulate and plot data for this model (lognormal geometric auto-regressive for the baseline, and lognormal geometric random walk³ for the exponential) with 10 species in total, each of which with 60 \(k\)-mers as before.

C <- matrix(
  data = NA,
  nrow = time_window,
  ncol = n_species
)

C[, baseline_indices] <- df_ar %>%
  filter(
    regime == "Baseline",
    id %in% baseline_indices
  ) %>%
  select(day, id, conc) %>%
  pivot_wider(
    names_from = id,
    values_from = conc
  ) %>%
  select(-day) %>%
  as.matrix()

C[, exponential_indices] <- df_ar %>%
  filter(
    regime == "Exponential",
    id %in% exponential_indices
  ) %>%
  select(day, id, conc) %>%
  pivot_wider(
    names_from = id,
    values_from = conc
  ) %>%
  select(-day) %>%
  as.matrix()

C

##              [,1]      [,2]      [,3]        [,4]      [,5]      [,6]       [,7]      [,8]       [,9]     [,10]
##  [1,]    51.86324 455.75900  43.85268  116.107174 26.865151  374.9348 340.291050 340.29880 431.645931 68.458586
##  [2,]   112.64032 258.33368 162.91927   24.688979 54.574773  478.7057 260.008248  95.59991 313.861965 90.536196
##  [3,]   131.24413  94.53869  69.23040   10.395658 27.865854  160.9867 112.799723 334.87112 123.512439 41.072013
##  [4,]   102.22304  62.14032 124.15064    9.557252 21.505399  157.6298  62.570679 233.05346  43.292227 79.902801
##  [5,]    82.38903  21.43774 158.61014   24.777646 34.319199  366.1655  69.247598 131.09531   9.671660 12.382247
##  [6,]    10.14592  73.08090 537.58526   51.818714 20.311840  314.7365  40.073057 294.45362   7.373497 11.025189
##  [7,]    15.10906  30.77254 751.02182   53.024307 20.308443  153.6723 190.820019 203.64471   3.095794 77.887591
##  [8,]    21.50460  15.41494 143.60707   20.739193 84.055043  115.7762  20.975074  69.78449  18.620751 79.126759
##  [9,]    24.70046  28.81690 431.75927   31.621192 44.185929  175.6399  42.941601  87.10092  32.440686 91.576668
## [10,]   393.63606 109.17488  92.04359  184.199699 52.877005  112.5956  28.303524  20.59793  11.686943 82.620189
## [11,]  3252.57183 483.68560 136.65551  714.702291 87.566839  469.3173  50.929730  73.56324  18.427043 82.490380
## [12,]  5927.83682 184.32047  34.46052  671.494642 87.066625  812.8341  36.484363  82.74719  18.352883 40.029919
## [13,]  8856.64870 116.35453 242.23226 1808.408739 15.326954 7326.3577  18.389014  27.77650  21.526956 11.490165
## [14,] 48002.89841  34.78237 239.94070  255.784237  4.529088 2635.8823   8.763516  44.64028   9.149630  8.465828

#' Bring together code from above
sample_kmers <- function(C, n_kmer, read_depth) {
  time_window <- nrow(C)
  n_species <- ncol(C)
  K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)
  K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
  sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))
  colnames(sample) <- paste0("day", 1:ncol(sample))
  rownames(sample) <- paste0(1:nrow(sample))
  
  sample %>%
    as.data.frame() %>%
    tibble::rownames_to_column("id") %>%
    pivot_longer(
      cols = starts_with("day"),
      names_to = "day",
      values_to = "count",
      names_prefix = "day"
    ) %>%
    mutate(
      id = as.numeric(id),
      day = as.numeric(day),
      kmer = rep(rep(1:n_kmer, each = time_window), times = n_species),
      species = rep(1:n_species, each = n_kmer * time_window),
      regime = str_to_title(purrr::map_chr(species, get_regime))
    )
}

sample_df <- sample_kmers(C, n_kmer, read_depth)

saveRDS(sample_df, "sample1.rds")

reads_plot(sample_df)

### Ideas for further extensions

To further this research direction, we could consider more sophisticated time-series models for the baseline and exponential regimes. This might include structural time-series models with seasonal and local trends. Doing so would help to highlight a problem that any NAO is going to encounter whereby there will be naturally periods of exponential growth in the baseline regime. It would be useful to understand (1) how often this might occur, (2) if there are, and whether it would be a good idea to pursue, strategies to avoid erroneously flagging this behaviour.

Sequencing observation models

See Townes (2020) for a nice review of models for count data.

Dirichlet-multinomial model

Rather than assuming that during sequencing the probabilities of sampling each \(k\)-mer are given by the true proportions of the \(k\)-mer, we might want to add additional noise. One way to do this is to sample the proportions \(w\) from a probability distribution over the simplex such as the Dirichlet⁴. Given a vector of positive real numbers \(\alpha = (\alpha_1, \ldots, \alpha_K)\) then \(w \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)\) if \[ p(w; \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} w_1^{\alpha_1 - 1} w_2^{\alpha_2 - 1} \times \cdots \times w_K^{\alpha_K - 1} \] The expected value of \(w_i\) is \(\alpha_i / \sum_{k = 1}^K \alpha_k\).

dirichlet_draw <- gtools::rdirichlet(n = 1, alpha = K_norm[1, ])

Methods for improving simulation accuracy

How could we go about making the simulations more realistic? The realism of a given model can be divided into structural realism and parametric realism⁵. Structural realism refers to the distributional assumptions made. Parametric realism refers to the specific (hyper-)parameter values chosen for the assumed distributions.

Realism can be assessed by (1) simulating data from the generative model we propose, (2) plotting and summarising the simulated data, possibly using dimension reduction techniques (3) comparing to real data, or our intuitions and expectations of it. The final stage requires developing a good understanding of the real data, or even better closely collaborating with domain scientists who have this understanding. Aside from allowing us to create more accurate simulations, developing a good understanding of the data we hope to model is a worthy goal in itself. I’m imagining an iterative process where we simulate data, ask scientist if it looks good, write down what they say, try to improve it, etc.

One method for improving parametric realism is to obtain real data, then fit a model to the real data to obtain posteriors over parameters of interest. We could then use posterior predictive samples drawn from the model as our simulations. In this setting, we may want to widen the posterior somewhat, as the sample of real data we obtain is unlikely to be representative of all real data we might encounter.

Structural realism may be improved by noticing stylistic features of the real data which the simulated data does not possess. These features might include shapes of distributions, tail heaviness, skewness, sources and structure of variation, sources and structure of biases, presence of outliers or multi-modalities etc.

Another way to improve simulation realism is to consider each module of the simulation separately, using relevant technical replicates to calibrate to. These technical replicates are from the literature they may not be directly relevant, but could provide a good starting place. If there are specific parts of the model where in-house experiments might be especially informative for improving simulation, we could consider that approach too.

For further reading on this, I would suggest the preprint (and upcoming book) Bayesian workflow by Gelman et al.

Ideas and notes for continuation of this document

Add more context for the system this framework fits inside
Consider some type of case-study with realistic parameter values
Flesh out the metagenomic sampling process (more than just a multinomial sample with number of reads as sample size)

Futher potentially useful links

Bibliography

Townes, F William. 2020. “Review of Probability Distributions for Modeling Count Data.” arXiv Preprint arXiv:2001.04343.

The units don’t matter much.↩︎
In reality, we expect a small proportion of species to be in the exponential regime. Exactly what proportion this would be remains to be determined.↩︎
Draws from Gaussian a growth rate with mean greater than zero (exponential regime) can be quite varied, so might be worth thinking about changing this too.↩︎
Note that Mike has a hunch that the Dirichlet Multinomial maybe is not as good as other possible models.↩︎
This distinction is somewhat arbitrary, in that there may be no clear distinction between structural and parameteric realism, but may still be useful. For example some distributions are general enough to admit other distributions as special cases when certain parameter versions are chosen e.g. the Gaussian might (perhaps unhelpfully) be thought of as a special case of the Student-t as the degrees of freedom tends to infinity.↩︎