Abstract
The purpose of this document is to specify a generative model for qPCR data which captures a realistic data generating process sufficiently to outperform the “standard curve” simple linear regression approach. We anticipate the majority of the performance benefit will be when there are few initial virus copies, close to the limit of detection (LOD).
Let’s try to model the amplification curve qPCR data.
Suppose that there are initially \(n_0 \in \mathbb{N}^+\) copies of the virus and the number of copies during amplification at cycle \(c \in \mathbb{N}^+\) is \(n_c\). Amplification typically results in a logistic curve which may be described by parameteric models. Two possible strategies for analysing this curve are (Ritz and Spiess 2008):
In some sense 2. is the more satisfying option, though the later stages of the curve, where resources start to be exhausted and growth slows, would seem to be less related to the copies of the virus and more related to the reaction conditions, and so of less interest for quantification.
First, the exponential component must be isolated. A simple method to do this would be to fit an exponential model and logistic model to decreasing subsets of the data, and find the first cycle number for which the exponential model fits better, according to some metric, than the logistic model.
Given the exponential component is isolated, it can either be modelled deterministically or stochastically.
In the deterministic approach \[ n_c = n_0 (1 + E)^c \] where \(E\) is the efficiency. The first derivative is \[ \frac{\text{d}n_c}{\text{d}c} = n_0 \log(1 + E) (1 + E)^c \]
In the stochastic approach, we can use a branching process model \[ n_{c + 1} = n_c + \text{Binomial}(n_c, E) \] where here the efficiency \(E\) must be in the range \((0, 1)\). Note that if \(E \in \{0, 1\}\) then the model is deterministic. We might want to think about extending this model to work for values of \(E > 1\). I’m unsure what this would correspond to mechanistically – presumably some copies are being multiplied more than once? Might we consider some Poisson model for the amplication instead of binomial?
So far, I am only aware of deterministic models for the complete curve. I imagine a mechanistic model might need to include parameters for many experimental conditions, and if possible we would prefer to avoid any thermodynamics. Perhaps it could work with some kind of cooling schedule on the efficiency of the reaction such that \(E \to 0\), where this trajectory is learnt.
Sticking to deterministic models, two examples are the four parameter and five parameter logistic models.
The four-parameter logistic model (4PL) is of the form \[ n_c = n_\infty + \frac{n_0 - n_\infty}{\left(1 + (c / m)^b \right)} \] where \(n_\infty = \lim_{t \to \infty} n_c\) is the asympotote of the curve at infinity, and \(n_0 = \lim_{t \to 0} n_c\) is–in some sense–the initial number of copies, \(b\) is the slope and \(m\) is the cycle at which the curve passes the midpoint between \(n_0\) and \(n_\infty\) (I will refer to this as “the midpoint”). The first derivative of \(n_c\) is \[ \frac{\text{d}n_c}{\text{d}c} = - (n_0 - n_{\infty}) \frac{1}{\left(1 + (c / m)^b \right)^2} b \frac{c^{b - 1}}{m^b}. \] Evaluating this at the midpoint, \(m\), gives \[ \frac{\text{d}n_c}{\text{d}c} \rvert_{c = m} = - (n_0 - n_{\infty}) \frac{b}{4m} = \frac{(n_{\infty} - n_0)}{4m} b, \] such that the derivative is proportional to the slope parameter \(b\).
The five-parameter logistic model (5PL) includes an additional parameter \(s\) which accounts for asymmetry of the curve \[ n_c = n_\infty + \frac{n_0 - n_{\infty}}{\left(1 + (c / m)^b \right)^s} \]
After specification of a model for the number of copies at each cycle number, we now need to link to the observed fluoresence.
We observe \((\text{Rn}_0, \ldots, \text{Rn}_C)\) where \(C\) is the total number of PCR cycles run. The number of copies \(n_c\) may be connected to fluorescence data \(\text{Rn}_c\) by some function \(g\) such that \(\text{Rn}_c \sim g(n_c)\). It is likely sufficient to consider that this relationship is linear such that \[ \text{Rn}_c = \alpha_0 + \alpha n_c + \epsilon_c \] where \(\alpha_0\) is the background fluorescence, \(\alpha\) is the fluorescence produced by a single count, and \(\epsilon_c\) is e.g. Gaussian noise. qPCR machines may estimate \(\alpha_0\) and provide fluoresence data with the background removed (i.e \(\Delta\)Rn) such that the model can be simplified to \[ \Delta\text{Rn}_c = \alpha n_c + \epsilon_c \]
In particular, \(\alpha_0\) can be estimated by
One idea for making the “deterministic” curve fitting methods more stochastic is to introduce stochastic cycle times. Suppose that cycles \(c = 1, \dots, C\) occur at times \(t = t_1, \ldots, t_C\) with cycle time intervals \(\kappa_c = t_{c + 1} - t_c\) between cycles. A simple model for the intervals would be \(\kappa_c \sim \mathcal{N}(\mu_\kappa, {\sigma_\kappa}^2)\). The interval mean could be fixed to one or have a prior centered at one. Cycle times are the cumulative sum of intervals \(t_c = \sum_{d < c} \kappa_d\).
If we want to stick to fitting the model in a probabilistic programming language which handles continuous variables, then perhaps this could be useful, but otherwise it seems like there are better options.
Prior to amplification, we might consider a processing step where e.g DNA is released from inside the phage capsid. This process should likely be modelled differently depending on the copy amplification model. Processing is more natural to include with an exponential amplification model because it is more mechanistic (it’s difficult to mechanistically model the logistic curve).
The 4PL and 5PL models do not have an intuitive way to incorperate a processing component, as the initial number of copies is not a (real) parameter of the model. As a work around, suppose that a proportion \(r\) of DNA are successfully released during processing from an initial number \(N_0\). Then \[ n_0 \sim \text{Binomial}(N_0, r) \] where some prior could be placed on \(r \sim p(r)\).
We could assume that reducing the initial copies of the virus by a factor of 10 increased the midpoint \(m\) by one unit. One possible way to do this is to set \(\log(r) \sim \mathcal{N}^{+}(0, 1)\), then alter the midpoint such that \(m \leftarrow m + \log(r)\). This approach won’t give the desired zero inflated behaviour.
The primary aims of our analysis are:
We will build and expand our model step-by-step using probabilistic programming languages such as Stan (Stan Development Team 2022) or NIMBLE (de Valpine et al. 2022).
We first start by considering a model including the 4PL logistic copy
amplification component linearly linked to florescence data. The Stan
code for this model is given below. We follow a Bayesian workflow (Gelman et al. 2020) by specifying a prior,
generating data from that prior, then fitting the model to the generated
data. This may be done using one Stan file by including a parameter
flag_run_estimation
.
The Stan code for this model is given below.
// 4pl.stan
functions {
real four_parameter_logistic(real t, real n_0, real n_inf, real m, real b) {
return(n_inf + ((n_0 - n_inf) / (1 + pow(t / m, b))));
}
}
data {
int T; // Total number of cycles
int<lower=0, upper=1> flag_run_estimation; // Should the likelihood be used?
real f[T]; // Observed fluorescence data
// real<lower=0> threshold; // Threshold value
}
parameters {
// 4PL
// real<lower=0> n_0; // Asymptote as t -> 0 (assume this is zero)
real<lower=0> n_inf; // Asymptote as t -> inf
real<lower=0> m; // Midpoint
real<lower=0> b; // Slope
// Link to fluoresence
// real<lower=0> beta_0; // Background fluoresence (assume this is removed)
real<lower=0> beta; // Slope (assume counts proportional to fluoresence)
real<lower=0> sigma_f; // Standard deviation of fluoresence
}
transformed parameters {
vector[T] n; // Counts
// 4PL logistic
for(t in 1:T) {
n[t] = four_parameter_logistic(t, 0, n_inf, m, b);
}
}
model {
// Prior
// n_0 ~ normal(1, 0.5); // A plausible range for the initial amount
n_inf ~ normal(8, 0.5); // A plausible range for the final amount
m ~ normal(20, 5); // The midpoint at around 20 with high uncertainty
b ~ normal(10, 1); // And the slope at around 10
// beta_0 ~ normal(0, 0.5);
beta ~ normal(1, 0.1); // Assume to be a 1-to-1 relationship approximately
sigma_f ~ normal(0, 0.1); // Some amount of signal noise in fluoresence measurements
// Likelihood (evaluated if flag_run_estimation is TRUE)
if(flag_run_estimation == 1){
f ~ normal(beta * n, sigma_f);
}
}
generated quantities {
real f_sim[T];
f_sim = normal_rng(beta * n, sigma_f);
}
// 4pl-processing.stan
functions {
real four_parameter_logistic(real t, real n_0, real n_inf, real m, real b) {
return(n_inf + ((n_0 - n_inf) / (1 + pow(t / m, b))));
}
// real xbinomial_logit_lpdf(real y, real m, real eta) {
// real dens = lchoose(m, y) + y * log(inv_logit(eta)) + (m - y) * log(1 - inv_logit(eta));
// real jacobian = log(inv_logit(eta)) + log(inv_logit(-eta)); // Alternative: eta - 2 * log(1 + exp(eta))
// return dens + jacobian;
// }
}
data {
int T; // Total number of cycles
int<lower=0, upper=1> flag_run_estimation; // Should the likelihood be used?
real f[T]; // Observed fluorescence data
}
parameters {
// Processing
real<lower=0> p_log; // log(p) where p is the proportion lost
// 4PL
// real<lower=0> n_0; // Asymptote as t -> 0 (assume this is zero)
real<lower=0> n_inf; // Asymptote as t -> inf
real<lower=0> m; // Midpoint
real<lower=0> b; // Slope
// Link to fluoresence
// real<lower=0> beta_0; // Background fluoresence (assume this is removed)
real<lower=0> beta; // Slope (assume counts proportional to fluoresence)
real<lower=0> sigma_f; // Standard deviation of fluoresence
}
transformed parameters {
vector[T] n; // Counts
for(t in 1:T) {
n[t] = four_parameter_logistic(t, 0, n_inf, m, b);
}
}
model {
// Prior
// Processing
p_log ~ normal(0, 1);
// 4PL logistic
n_inf ~ normal(10, 0.5); // Give a plausible range for the final amount
m ~ normal(20 + p_log, 5);
b ~ normal(10, 1);
// Link to fluoresence
// beta_0 ~ normal(0, 0.5);
beta ~ normal(1, 0.1);
sigma_f ~ normal(0, 0.25);
// Likelihood (evaluated if flag_run_estimation is TRUE)
if(flag_run_estimation == 1){
f ~ normal(beta * n, sigma_f);
}
}
generated quantities {
real f_sim[T];
f_sim = normal_rng(beta * n, sigma_f);
}