Set-up

Consider an IID latent field. Without aggregation: \[ \begin{align*} y &= \sum_{i = 1}^n y_i, \\ y_i &\sim \text{Bin}(m, \rho_i), \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \] With aggregation: \[ \begin{align*} y &\sim \text{Bin}(nm, \bar \rho), \\ \bar \rho &= \frac{1}{n} \sum_{i = 1}^n \rho_i, \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \]

experiment <- function(nsim, n, phi_mean, phi_sd, y_sample_size) {
  phi <- rnorm(n, phi_mean, phi_sd)
  rho <- plogis(phi)
  m <- rep(y_sample_size, n)

  #' Without aggregation
  y <- replicate(nsim, rbinom(n, m, rho), simplify = FALSE)
  y <- lapply(y, sum)

  #' With aggregation (at the level of the latent field)
  y_agg <- replicate(nsim, rbinom(1, sum(m), weighted.mean(rho)), simplify = FALSE)

  return(list(y = unlist(y), y_agg = unlist(y_agg), phi = phi, rho = rho))
}

plot_hist <- function(result) {
  data.frame(
    type = c(rep("No aggregation (truth)", nsim), rep("Aggregation (approximation)", nsim)),
    y = c(result$y, result$y_agg)
  ) %>%
    ggplot() +
    geom_histogram(aes(x = y, fill = type), alpha = 0.5, position = "identity") +
    theme_minimal() +
    theme(
      legend.position = "bottom"
    )
}

nsim <- 500
n <- 36

Low probability setting (disease prevalence)

result <- experiment(nsim, n, -2.5, 0.5, 30)
plot(result$rho) #' This is what the underlying latent field looks like for these settings

plot_hist(result) #' And then the difference between the truth and the aggregated approximation

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sd(result$y)

## [1] 9.333093

sd(result$y_agg)

## [1] 8.828432

sd(result$y_agg) / sd(result$y) #' Ratio

## [1] 0.9459277

Probability approximately 1/2 setting (elections)

As before but with different phi_mean settings:

result_half <- experiment(nsim, n, 0, 0.5, 30)
plot(result_half$rho)

plot_hist(result_half)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sd(result_half$y)

## [1] 16.40742

sd(result_half$y_agg)

## [1] 16.69478

sd(result_half$y_agg) / sd(result_half$y) #' Ratio

## [1] 1.017514

Conclusions, things to add

Standard deviation of aggregation tends to overestimate true standard deviation,
Effect is larger in the small rho setting than the large rho setting
Should test latent field with spatial structure e.g. ICAR latent field and so on
What is the effect of varying sd_phi? What is the effect of varying y_sample_size?

Original computing environment

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rmarkdown_2.18       bayesplot_1.9.0      rstan_2.21.5         StanHeaders_2.21.0-7 multi.utils_0.1.0   
##  [6] purrr_0.3.5          sn_2.0.2             tidyr_1.2.1          gt_0.8.0             magrittr_2.0.3      
## [11] cubature_2.0.4.4     tikzDevice_0.12.3.1  viridis_0.6.2        viridisLite_0.4.1    reshape2_1.4.4      
## [16] dplyr_1.0.10         ggplot2_3.4.0        bsae_0.2.7           sf_1.0-9            
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.62.0  numDeriv_2016.8-1.1 tools_4.2.0         bslib_0.4.1         utf8_1.2.2         
##  [6] R6_2.5.1            KernSmooth_2.23-20  DBI_1.1.3           colorspace_2.0-3    withr_2.5.0        
## [11] sp_1.5-1            tidyselect_1.2.0    gridExtra_2.3       prettyunits_1.1.1   mnormt_2.0.2       
## [16] processx_3.5.3      compiler_4.2.0      cli_3.4.1           labeling_0.4.2      sass_0.4.4         
## [21] scales_1.2.1        classInt_0.4-8      ggridges_0.5.3      callr_3.7.0         proxy_0.4-27       
## [26] askpass_1.1         stringr_1.5.0       digest_0.6.30       pkgconfig_2.0.3     htmltools_0.5.3    
## [31] fastmap_1.1.0       highr_0.9           rlang_1.0.6         rstudioapi_0.13     jquerylib_0.1.4    
## [36] generics_0.1.3      farver_2.1.1        jsonlite_1.8.4      inline_0.3.19       loo_2.5.1          
## [41] Matrix_1.5-1        Rcpp_1.0.9          munsell_0.5.0       fansi_1.0.3         lifecycle_1.0.3    
## [46] stringi_1.7.8       yaml_2.3.6          pkgbuild_1.3.1      plyr_1.8.8          grid_4.2.0         
## [51] parallel_4.2.0      crayon_1.5.2        lattice_0.20-45     cowplot_1.1.1       INLA_22.05.07      
## [56] splines_4.2.0       BH_1.78.0-0         tmvnsim_1.0-2       knitr_1.41          ps_1.7.0           
## [61] pillar_1.8.1        uuid_1.1-0          codetools_0.2-18    glue_1.6.2          evaluate_0.18      
## [66] RcppParallel_5.1.5  vctrs_0.5.1         ids_1.0.1           MatrixModels_0.5-0  gtable_0.3.1       
## [71] openssl_2.0.5       assertthat_0.2.1    cachem_1.0.6        xfun_0.35           RcppEigen_0.3.3.9.3
## [76] e1071_1.7-12        filehash_2.4-3      class_7.3-20        tibble_3.1.8        orderly_1.4.3      
## [81] units_0.8-0         ellipsis_0.3.2

Binomial of sum of Gaussians versus sum of binomial of Gaussians

Adam Howes

Set-up

Low probability setting (disease prevalence)

Probability approximately 1/2 setting (elections)

Conclusions, things to add

Original computing environment