Statistical analysis of metagenomics data

Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459172/pdf/gi-2019-17-1-e6.pdf

  • Compositional data is of the form \(x = (x_1, \ldots, x_k)\) such that \(x_i > 0\) and \(\sum_i x_i = T\)
  • The total \(T\) is defined not to be of importance
  • Equivalence to some member of the unit simplex \(\mathcal{S} = \{x = (x_1, \ldots, x_k); \quad x_i > 0; \quad \sum_i x_i = 1\}\)
  • Desirata for CoDA
    • Permutation invariance: changing the order of the parts doesn’t matter
    • Scale invariance: any member of the equivalence class should have the same results
    • Sub-compositional coherence: results for a subset should be coherent with results for the whole composition
  • Log-ratio \(\log(x_i / x_j)\)
  • Log-contrast \(\sum_i a_i \log(x_i); \quad \sum_i a_i = 0\)
  • Additive log-transformation \(\text{alr}(x) = (\log(x_i / x_k))_{i < k}\)
  • Centered log-transformation \(\text{clr}(x) = (\log(x_i / g(x)))_i\) where \(g(x) = (\prod_i x_i)^{1 / k}\) is the geometric mean
  • Isometric log-transformation (not described)
  • Microbiome abundance table is a matrix \(X\) with \(n\) rows (samples) and \(k\) columns (taxa) where each entry \(x_ij\) provides the number of reads corresponding to taxon \(j\) in sample \(i\)
  • Other elements which could be available include the sample data, the taxonomy table, and the phylogenetic tree
  • Alpha diversity: within sample diversity
  • Richness \(R_\text{obs}\) gives the number of species

Visualizing the Multinomial in the Simplex

Link: http://www.statsathome.com/2017/09/14/visualizing-the-multinomial-in-the-simplex/

  • The multinomial is a discretisation of the simplex
  • The true value is \((.37, .37, .3)\) but there is no way to estimate that when the sample size is \(n = 10\)
library(tidyverse)
library(driver)
library(combinat)
## 
## Attaching package: 'combinat'
## The following object is masked from 'package:utils':
## 
##     combn
library(ggtern)
## Registered S3 methods overwritten by 'ggtern':
##   method           from   
##   grid.draw.ggplot ggplot2
##   plot.ggplot      ggplot2
##   print.ggplot     ggplot2
## --
## Remember to cite, run citation(package = 'ggtern') for further info.
## --
## 
## Attaching package: 'ggtern'
## The following objects are masked from 'package:ggplot2':
## 
##     aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob, ggsave, layer_data, theme_bw, theme_classic,
##     theme_dark, theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void
plot_ternary_multinomial_density <- function(n, prob){
  sx <- t(xsimplex(p = 3, n = n))
  dx <- apply(sx, 1, function(x) dmultinom(x, size = n, prob = prob))
  sx <- cbind(as.data.frame(miniclo(sx)), dx)
  colnames(sx) <- c("v1", "v2", "v3", "d")
  
  s <- ggtern(sx, aes(x = v1, y = v2, z = v3)) +
    geom_point(alpha = 0.4, aes(size = d/max(d))) +
    theme_bw() +
    guides(size = FALSE) +
    scale_size(range = c(.1, 2))
  s
}
prob <- miniclo(c(.37, .37, .3))
d <- setNames(as.data.frame(prob), c("v1", "v2", "v3"))
plot_ternary_multinomial_density(10, prob) +
  geom_point(data = d, color = "red", size = 2, fill = "red")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

* As \(n\) gets larger it’s easier to estimate \(p\)

d <- setNames(as.data.frame(prob), c("v1", "v2", "v3"))
plot_ternary_multinomial_density(100, prob) +
  geom_point(data = d, color = "red", size = 2, fill = "red")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

* As \(p\) gets smaller, it’s harder to estimate

prob <- miniclo(c(.9, .091, .005))
p1 <- plot_ternary_multinomial_density(100, prob) +
  geom_point(data=setNames(as.data.frame(prob), c("v1", "v2", "v3")), color = "red", size = 2, fill="red")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
p2 <- p1 + theme_zoom_L(0.2)

grid.arrange(p1, p2, ncol=2)

We can do better than the ALR or Softmax Transform

Link: http://www.statsathome.com/2017/08/09/we-can-do-better-than-the-alr-or-softmax-transform/

  • The ALR transform corresponds to an oblique coordinate system in the simplex (i.e. it is not an orthogonal basis)
library(compositions)
## Welcome to compositions, a package for compositional data analysis.
## Find an intro with "? compositions"
## 
## Attaching package: 'compositions'
## The following objects are masked from 'package:driver':
## 
##     alr, alrInv, clr, clrInv, ilr, ilrInv
## The following objects are masked from 'package:stats':
## 
##     anova, cor, cov, dist, var
## The following objects are masked from 'package:base':
## 
##     %*%, norm, scale, scale.default
t <- seq(0, 2 * pi, by = 0.1)
x <- cos(t)
y <- sin(t)
circ <- cbind(x, y)

par(mfrow = c(1, 2))
plot(circ, asp = 1)
plot(alr(alrInv(circ, 2), 3), asp = 1)

coords <- as.matrix(rbind(
  cbind(0, seq(-1, 1, by = 0.1)), 
  cbind(seq(-1, 1, by = 0.1), 0))
)

par(mfrow=c(1, 2))
plot(acomp(alrInv(coords, 2))) 
plot(acomp(alrInv(coords, 3)))