Set-up

Consider an IID latent field. Without aggregation: \[ \begin{align*} y &= \sum_{i = 1}^n y_i, \\ y_i &\sim \text{Bin}(m, \rho_i), \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \] With aggregation: \[ \begin{align*} y &\sim \text{Bin}(nm, \bar \rho), \\ \bar \rho &= \frac{1}{n} \sum_{i = 1}^n \rho_i, \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \]

experiment <- function(nsim, n, phi_mean, phi_sd, y_sample_size) {
  phi <- rnorm(n, phi_mean, phi_sd)
  rho <- plogis(phi)
  m <- rep(y_sample_size, n)

  #' Without aggregation
  y <- replicate(nsim, rbinom(n, m, rho), simplify = FALSE)
  y <- lapply(y, sum)

  #' With aggregation (at the level of the latent field)
  y_agg <- replicate(nsim, rbinom(1, sum(m), weighted.mean(rho)), simplify = FALSE)

  return(list(y = unlist(y), y_agg = unlist(y_agg), phi = phi, rho = rho))
}

plot_hist <- function(result) {
  data.frame(
    type = c(rep("No aggregation (truth)", nsim), rep("Aggregation (approximation)", nsim)),
    y = c(result$y, result$y_agg)
  ) %>%
    ggplot() +
    geom_histogram(aes(x = y, fill = type), alpha = 0.5, position = "identity") +
    theme_minimal() +
    theme(
      legend.position = "bottom"
    )
}

nsim <- 500
n <- 36

Low probability setting (disease prevalence)

result <- experiment(nsim, n, -2.5, 0.5, 30)
plot(result$rho) #' This is what the underlying latent field looks like for these settings

plot_hist(result) #' And then the difference between the truth and the aggregated approximation
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sd(result$y)
## [1] 9.333093
sd(result$y_agg)
## [1] 8.828432
sd(result$y_agg) / sd(result$y) #' Ratio
## [1] 0.9459277

Probability approximately 1/2 setting (elections)

As before but with different phi_mean settings:

result_half <- experiment(nsim, n, 0, 0.5, 30)
plot(result_half$rho)

plot_hist(result_half)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sd(result_half$y)
## [1] 16.40742
sd(result_half$y_agg)
## [1] 16.69478
sd(result_half$y_agg) / sd(result_half$y) #' Ratio
## [1] 1.017514

Conclusions, things to add

  • Standard deviation of aggregation tends to overestimate true standard deviation,
  • Effect is larger in the small rho setting than the large rho setting
  • Should test latent field with spatial structure e.g. ICAR latent field and so on
  • What is the effect of varying sd_phi? What is the effect of varying y_sample_size?

Original computing environment

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rmarkdown_2.18       bayesplot_1.9.0      rstan_2.21.5         StanHeaders_2.21.0-7 multi.utils_0.1.0   
##  [6] purrr_0.3.5          sn_2.0.2             tidyr_1.2.1          gt_0.8.0             magrittr_2.0.3      
## [11] cubature_2.0.4.4     tikzDevice_0.12.3.1  viridis_0.6.2        viridisLite_0.4.1    reshape2_1.4.4      
## [16] dplyr_1.0.10         ggplot2_3.4.0        bsae_0.2.7           sf_1.0-9            
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.62.0  numDeriv_2016.8-1.1 tools_4.2.0         bslib_0.4.1         utf8_1.2.2         
##  [6] R6_2.5.1            KernSmooth_2.23-20  DBI_1.1.3           colorspace_2.0-3    withr_2.5.0        
## [11] sp_1.5-1            tidyselect_1.2.0    gridExtra_2.3       prettyunits_1.1.1   mnormt_2.0.2       
## [16] processx_3.5.3      compiler_4.2.0      cli_3.4.1           labeling_0.4.2      sass_0.4.4         
## [21] scales_1.2.1        classInt_0.4-8      ggridges_0.5.3      callr_3.7.0         proxy_0.4-27       
## [26] askpass_1.1         stringr_1.5.0       digest_0.6.30       pkgconfig_2.0.3     htmltools_0.5.3    
## [31] fastmap_1.1.0       highr_0.9           rlang_1.0.6         rstudioapi_0.13     jquerylib_0.1.4    
## [36] generics_0.1.3      farver_2.1.1        jsonlite_1.8.4      inline_0.3.19       loo_2.5.1          
## [41] Matrix_1.5-1        Rcpp_1.0.9          munsell_0.5.0       fansi_1.0.3         lifecycle_1.0.3    
## [46] stringi_1.7.8       yaml_2.3.6          pkgbuild_1.3.1      plyr_1.8.8          grid_4.2.0         
## [51] parallel_4.2.0      crayon_1.5.2        lattice_0.20-45     cowplot_1.1.1       INLA_22.05.07      
## [56] splines_4.2.0       BH_1.78.0-0         tmvnsim_1.0-2       knitr_1.41          ps_1.7.0           
## [61] pillar_1.8.1        uuid_1.1-0          codetools_0.2-18    glue_1.6.2          evaluate_0.18      
## [66] RcppParallel_5.1.5  vctrs_0.5.1         ids_1.0.1           MatrixModels_0.5-0  gtable_0.3.1       
## [71] openssl_2.0.5       assertthat_0.2.1    cachem_1.0.6        xfun_0.35           RcppEigen_0.3.3.9.3
## [76] e1071_1.7-12        filehash_2.4-3      class_7.3-20        tibble_3.1.8        orderly_1.4.3      
## [81] units_0.8-0         ellipsis_0.3.2