Introduction

Background

Quantitative polymerase chain reaction (qPCR) is a technique used to detect and quantify DNA sequence concentration in a sample
The target DNA sequence is amplified in cycles
\(C_q\) is the cycle number at which the amplification curve intersects some threshold line
This threshold line is typically determined automatically by the qPCR machine software
Many factors impact the absolute value of \(C_q\) besides the concentration of the target DNA
The amplification curve is determined by measuring fluorescence
Rn is a measure of fluorescence, calculated as a ratio of the FAM dye to ROX dye
\(\Delta\)Rn is Rn minus the baseline fluorescence
Baseline fluoresence is also calculated automatically by the qPCR machine software
Artefacts from the reaction mix or instrument can change the fluoresence measurements
Fluorescence emission of any molecule depends on environmental factors (pH, salt concentration)

The standard curve method

Let \(x\) be \(\log_{10}(\text{copies per ml})\)
Carry out experiments for \(x = 10^d\) where \(d\) is a sequence of integers
Fit a linear regression \(C_q = \beta_0 + \beta x\)
The key robust relationship is that alterations of the number of copies by factors of ten should result in the same additive change in \(C_q\)

Statistical model

Let’s try to model the amplification curve qPCR data.

Copy amplification

Suppose that there are initially \(n_0 \in \mathbb{N}^+\) copies of the virus and the number of copies during amplification at cycle \(c \in \mathbb{N}^+\) is \(n_c\). Amplification typically results in a logistic curve which may be described by parameteric models. Two possible strategies for analysing this curve are (Ritz and Spiess 2008):

only model the exponential component,
model the complete curve.

In some sense 2. is the more satisfying option, though the later stages of the curve, where resources start to be exhausted and growth slows, would seem to be less related to the copies of the virus and more related to the reaction conditions, and so of less interest for quantification.

Models for only the exponential component

First, the exponential component must be isolated. A simple method to do this would be to fit an exponential model and logistic model to decreasing subsets of the data, and find the first cycle number for which the exponential model fits better, according to some metric, than the logistic model.

Given the exponential component is isolated, it can either be modelled deterministically or stochastically.

In the deterministic approach \[ n_c = n_0 (1 + E)^c \] where \(E\) is the efficiency. The first derivative is \[ \frac{\text{d}n_c}{\text{d}c} = n_0 \log(1 + E) (1 + E)^c \]

In the stochastic approach, we can use a branching process model \[ n_{c + 1} = n_c + \text{Binomial}(n_c, E) \] where here the efficiency \(E\) must be in the range \((0, 1)\). Note that if \(E \in \{0, 1\}\) then the model is deterministic. We might want to think about extending this model to work for values of \(E > 1\). I’m unsure what this would correspond to mechanistically – presumably some copies are being multiplied more than once? Might we consider some Poisson model for the amplication instead of binomial?

Models for the complete curve

So far, I am only aware of deterministic models for the complete curve. I imagine a mechanistic model might need to include parameters for many experimental conditions, and if possible we would prefer to avoid any thermodynamics. Perhaps it could work with some kind of cooling schedule on the efficiency of the reaction such that \(E \to 0\), where this trajectory is learnt.

Sticking to deterministic models, two examples are the four parameter and five parameter logistic models.

The four-parameter logistic model (4PL) is of the form \[ n_c = n_\infty + \frac{n_0 - n_\infty}{\left(1 + (c / m)^b \right)} \] where \(n_\infty = \lim_{t \to \infty} n_c\) is the asympotote of the curve at infinity, and \(n_0 = \lim_{t \to 0} n_c\) is–in some sense–the initial number of copies, \(b\) is the slope and \(m\) is the cycle at which the curve passes the midpoint between \(n_0\) and \(n_\infty\) (I will refer to this as “the midpoint”). The first derivative of \(n_c\) is \[ \frac{\text{d}n_c}{\text{d}c} = - (n_0 - n_{\infty}) \frac{1}{\left(1 + (c / m)^b \right)^2} b \frac{c^{b - 1}}{m^b}. \] Evaluating this at the midpoint, \(m\), gives \[ \frac{\text{d}n_c}{\text{d}c} \rvert_{c = m} = - (n_0 - n_{\infty}) \frac{b}{4m} = \frac{(n_{\infty} - n_0)}{4m} b, \] such that the derivative is proportional to the slope parameter \(b\).

The five-parameter logistic model (5PL) includes an additional parameter \(s\) which accounts for asymmetry of the curve \[ n_c = n_\infty + \frac{n_0 - n_{\infty}}{\left(1 + (c / m)^b \right)^s} \]

Extension linking copies to fluorescence

After specification of a model for the number of copies at each cycle number, we now need to link to the observed fluoresence.

We observe \((\text{Rn}_0, \ldots, \text{Rn}_C)\) where \(C\) is the total number of PCR cycles run. The number of copies \(n_c\) may be connected to fluorescence data \(\text{Rn}_c\) by some function \(g\) such that \(\text{Rn}_c \sim g(n_c)\). It is likely sufficient to consider that this relationship is linear such that \[ \text{Rn}_c = \alpha_0 + \alpha n_c + \epsilon_c \] where \(\alpha_0\) is the background fluorescence, \(\alpha\) is the fluorescence produced by a single count, and \(\epsilon_c\) is e.g. Gaussian noise. qPCR machines may estimate \(\alpha_0\) and provide fluoresence data with the background removed (i.e \(\Delta\)Rn) such that the model can be simplified to \[ \Delta\text{Rn}_c = \alpha n_c + \epsilon_c \]

In particular, \(\alpha_0\) can be estimated by

Selecting some initial phase of the reaction, where little amplification has occurred
Fitting a constant, or a linear regression (this is what the QuantStudio 3 and 5 Real-Time PCR System Software does), to this iniital phase

Extension to stochastic cycle times

One idea for making the “deterministic” curve fitting methods more stochastic is to introduce stochastic cycle times. Suppose that cycles \(c = 1, \dots, C\) occur at times \(t = t_1, \ldots, t_C\) with cycle time intervals \(\kappa_c = t_{c + 1} - t_c\) between cycles. A simple model for the intervals would be \(\kappa_c \sim \mathcal{N}(\mu_\kappa, {\sigma_\kappa}^2)\). The interval mean could be fixed to one or have a prior centered at one. Cycle times are the cumulative sum of intervals \(t_c = \sum_{d < c} \kappa_d\).

If we want to stick to fitting the model in a probabilistic programming language which handles continuous variables, then perhaps this could be useful, but otherwise it seems like there are better options.

Extension to processing

Prior to amplification, we might consider a processing step where e.g DNA is released from inside the phage capsid. This process should likely be modelled differently depending on the copy amplification model. Processing is more natural to include with an exponential amplification model because it is more mechanistic (it’s difficult to mechanistically model the logistic curve).

Processing with exponentation component

The 4PL and 5PL models do not have an intuitive way to incorperate a processing component, as the initial number of copies is not a (real) parameter of the model. As a work around, suppose that a proportion \(r\) of DNA are successfully released during processing from an initial number \(N_0\). Then \[ n_0 \sim \text{Binomial}(N_0, r) \] where some prior could be placed on \(r \sim p(r)\).

Processing with complete curve

We could assume that reducing the initial copies of the virus by a factor of 10 increased the midpoint \(m\) by one unit. One possible way to do this is to set \(\log(r) \sim \mathcal{N}^{+}(0, 1)\), then alter the midpoint such that \(m \leftarrow m + \log(r)\). This approach won’t give the desired zero inflated behaviour.

Implementation

The primary aims of our analysis are:

To infer \(N_0\) directly
To infer \(C_q\), allowing a comparison against the value produced by default in-built software
To model multiple experiments jointly, eventually resulting in a posterior distribution over the standard curve
To appropriately deal with non-Gaussian uncertainty For example, a property we would like to confer is inflated errors at high \(C_q\) values, corresponding to low initial numbers of copies.

We will build and expand our model step-by-step using probabilistic programming languages such as Stan (Stan Development Team 2022) or NIMBLE (de Valpine et al. 2022).

Model 1: Logistic curve copy amplification, link to fluorescence

We first start by considering a model including the 4PL logistic copy amplification component linearly linked to florescence data. The Stan code for this model is given below. We follow a Bayesian workflow (Gelman et al. 2020) by specifying a prior, generating data from that prior, then fitting the model to the generated data. This may be done using one Stan file by including a parameter flag_run_estimation.

Prior distribution with ten samples from the prior

Prior distribution with ten samples from the prior, and posterior distribution fitted to one sample from the prior.

Model 2: Logistic curve copy amplification, link to fluoresence, processing with complete curve

The Stan code for this model is given below.

Prior distribution with ten samples from the prior

Prior distribution with ten samples from the prior, and posterior distribution fitted to one sample from the prior.

Model files

Model 1

// 4pl.stan

functions {
  real four_parameter_logistic(real t, real n_0, real n_inf, real m, real b) {
    return(n_inf +  ((n_0 - n_inf) / (1 + pow(t / m, b))));
  }
}

data {
  int T;                                     // Total number of cycles
  int<lower=0, upper=1> flag_run_estimation; // Should the likelihood be used?
  real f[T];                                  // Observed fluorescence data
  // real<lower=0> threshold;                    // Threshold value
}

parameters {
  // 4PL
  // real<lower=0> n_0;    // Asymptote as t -> 0 (assume this is zero)
  real<lower=0> n_inf;     // Asymptote as t -> inf
  real<lower=0> m;         // Midpoint
  real<lower=0> b;         // Slope

  // Link to fluoresence
  // real<lower=0> beta_0; // Background fluoresence (assume this is removed)
  real<lower=0> beta;      // Slope (assume counts proportional to fluoresence)
  real<lower=0> sigma_f;   // Standard deviation of fluoresence
}

transformed parameters {
  vector[T] n; // Counts

  // 4PL logistic
  for(t in 1:T) {
    n[t] = four_parameter_logistic(t, 0, n_inf, m, b);
  }
}

model {
  // Prior
  // n_0 ~ normal(1, 0.5); // A plausible range for the initial amount
  n_inf ~ normal(8, 0.5);  // A plausible range for the final amount
  m ~ normal(20, 5);       // The midpoint at around 20 with high uncertainty
  b ~ normal(10, 1);       // And the slope at around 10

  // beta_0 ~ normal(0, 0.5);
  beta ~ normal(1, 0.1);     // Assume to be a 1-to-1 relationship approximately
  sigma_f ~ normal(0, 0.1); // Some amount of signal noise in fluoresence measurements

  // Likelihood (evaluated if flag_run_estimation is TRUE)
  if(flag_run_estimation == 1){
    f ~ normal(beta * n, sigma_f);
  }
}

generated quantities {
  real f_sim[T];
  f_sim = normal_rng(beta * n, sigma_f);
}

Model 2

// 4pl-processing.stan

functions {
  real four_parameter_logistic(real t, real n_0, real n_inf, real m, real b) {
    return(n_inf +  ((n_0 - n_inf) / (1 + pow(t / m, b))));
  }

  // real xbinomial_logit_lpdf(real y, real m, real eta) {
  //   real dens = lchoose(m, y) + y * log(inv_logit(eta)) + (m - y) * log(1 - inv_logit(eta));
  //   real jacobian = log(inv_logit(eta)) + log(inv_logit(-eta)); // Alternative: eta - 2 * log(1 + exp(eta))
  //   return dens + jacobian;
  // }
}

data {
  int T;                                     // Total number of cycles
  int<lower=0, upper=1> flag_run_estimation; // Should the likelihood be used?
  real f[T];                                 // Observed fluorescence data
}

parameters {
  // Processing
  real<lower=0> p_log;     // log(p) where p is the proportion lost

  // 4PL
  // real<lower=0> n_0;    // Asymptote as t -> 0 (assume this is zero)
  real<lower=0> n_inf;     // Asymptote as t -> inf
  real<lower=0> m;         // Midpoint
  real<lower=0> b;         // Slope

  // Link to fluoresence
  // real<lower=0> beta_0; // Background fluoresence (assume this is removed)
  real<lower=0> beta;      // Slope (assume counts proportional to fluoresence)
  real<lower=0> sigma_f;   // Standard deviation of fluoresence
}

transformed parameters {
  vector[T] n; // Counts
  for(t in 1:T) {
    n[t] = four_parameter_logistic(t, 0, n_inf, m, b);
  }
}

model {
  // Prior

  // Processing
  p_log ~ normal(0, 1);

  // 4PL logistic
  n_inf ~ normal(10, 0.5); // Give a plausible range for the final amount
  m ~ normal(20 + p_log, 5);
  b ~ normal(10, 1);

  // Link to fluoresence
  // beta_0 ~ normal(0, 0.5);
  beta ~ normal(1, 0.1);
  sigma_f ~ normal(0, 0.25);

  // Likelihood (evaluated if flag_run_estimation is TRUE)

  if(flag_run_estimation == 1){
    f ~ normal(beta * n, sigma_f);
  }
}

generated quantities {
  real f_sim[T];
  f_sim = normal_rng(beta * n, sigma_f);
}

Bibliography

de Valpine, Perry, Christopher Paciorek, Daniel Turek, Nick Michaud, Cliff Anderson-Bergman, Fritz Obermeyer, Claudia Wehrhahn Cortes, Abel Rodrìguez, Duncan Temple Lang, and Sally Paganin. 2022. NIMBLE User Manual (version 0.12.2). https://doi.org/10.5281/zenodo.1211190.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” arXiv Preprint arXiv:2011.01808.

Ritz, Christian, and Andrej-Nikolai Spiess. 2008. “qpcR: An r Package for Sigmoidal Model Selection in Quantitative Real-Time Polymerase Chain Reaction Analysis.” Bioinformatics 24 (13): 1549–51.

Stan Development Team. 2022. “RStan: The R Interface to Stan.” https://mc-stan.org/.

A generative model for qPCR data

Adam Howes