Abstract
Background Suppose that you take a binomial sample from the average of Gaussians. How different is that to if you aggregated binomial samples from each Gaussian?
Task We simulate data from both models in a setting where \(\rho\) is low (disease prevalence) and when \(\rho \approx 0.5\) (elections) and compare.
Consider an IID latent field. Without aggregation: \[ \begin{align*} y &= \sum_{i = 1}^n y_i, \\ y_i &\sim \text{Bin}(m, \rho_i), \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \] With aggregation: \[ \begin{align*} y &\sim \text{Bin}(nm, \bar \rho), \\ \bar \rho &= \frac{1}{n} \sum_{i = 1}^n \rho_i, \\ \rho_i &\sim \text{Logitnormal}(\mu, \sigma^2) \end{align*} \]
experiment <- function(nsim, n, phi_mean, phi_sd, y_sample_size) {
phi <- rnorm(n, phi_mean, phi_sd)
rho <- plogis(phi)
m <- rep(y_sample_size, n)
#' Without aggregation
y <- replicate(nsim, rbinom(n, m, rho), simplify = FALSE)
y <- lapply(y, sum)
#' With aggregation (at the level of the latent field)
y_agg <- replicate(nsim, rbinom(1, sum(m), weighted.mean(rho)), simplify = FALSE)
return(list(y = unlist(y), y_agg = unlist(y_agg), phi = phi, rho = rho))
}
plot_hist <- function(result) {
data.frame(
type = c(rep("No aggregation (truth)", nsim), rep("Aggregation (approximation)", nsim)),
y = c(result$y, result$y_agg)
) %>%
ggplot() +
geom_histogram(aes(x = y, fill = type), alpha = 0.5, position = "identity") +
theme_minimal() +
theme(
legend.position = "bottom"
)
}
nsim <- 500
n <- 36
result <- experiment(nsim, n, -2.5, 0.5, 30)
plot(result$rho) #' This is what the underlying latent field looks like for these settings
plot_hist(result) #' And then the difference between the truth and the aggregated approximation
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sd(result$y)
## [1] 9.333093
sd(result$y_agg)
## [1] 8.828432
sd(result$y_agg) / sd(result$y) #' Ratio
## [1] 0.9459277
As before but with different phi_mean
settings:
result_half <- experiment(nsim, n, 0, 0.5, 30)
plot(result_half$rho)
plot_hist(result_half)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sd(result_half$y)
## [1] 16.40742
sd(result_half$y_agg)
## [1] 16.69478
sd(result_half$y_agg) / sd(result_half$y) #' Ratio
## [1] 1.017514
rho
setting than the
large rho
settingsd_phi
? What is the
effect of varying y_sample_size
?sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_2.18 bayesplot_1.9.0 rstan_2.21.5 StanHeaders_2.21.0-7 multi.utils_0.1.0
## [6] purrr_0.3.5 sn_2.0.2 tidyr_1.2.1 gt_0.8.0 magrittr_2.0.3
## [11] cubature_2.0.4.4 tikzDevice_0.12.3.1 viridis_0.6.2 viridisLite_0.4.1 reshape2_1.4.4
## [16] dplyr_1.0.10 ggplot2_3.4.0 bsae_0.2.7 sf_1.0-9
##
## loaded via a namespace (and not attached):
## [1] matrixStats_0.62.0 numDeriv_2016.8-1.1 tools_4.2.0 bslib_0.4.1 utf8_1.2.2
## [6] R6_2.5.1 KernSmooth_2.23-20 DBI_1.1.3 colorspace_2.0-3 withr_2.5.0
## [11] sp_1.5-1 tidyselect_1.2.0 gridExtra_2.3 prettyunits_1.1.1 mnormt_2.0.2
## [16] processx_3.5.3 compiler_4.2.0 cli_3.4.1 labeling_0.4.2 sass_0.4.4
## [21] scales_1.2.1 classInt_0.4-8 ggridges_0.5.3 callr_3.7.0 proxy_0.4-27
## [26] askpass_1.1 stringr_1.5.0 digest_0.6.30 pkgconfig_2.0.3 htmltools_0.5.3
## [31] fastmap_1.1.0 highr_0.9 rlang_1.0.6 rstudioapi_0.13 jquerylib_0.1.4
## [36] generics_0.1.3 farver_2.1.1 jsonlite_1.8.4 inline_0.3.19 loo_2.5.1
## [41] Matrix_1.5-1 Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.3
## [46] stringi_1.7.8 yaml_2.3.6 pkgbuild_1.3.1 plyr_1.8.8 grid_4.2.0
## [51] parallel_4.2.0 crayon_1.5.2 lattice_0.20-45 cowplot_1.1.1 INLA_22.05.07
## [56] splines_4.2.0 BH_1.78.0-0 tmvnsim_1.0-2 knitr_1.41 ps_1.7.0
## [61] pillar_1.8.1 uuid_1.1-0 codetools_0.2-18 glue_1.6.2 evaluate_0.18
## [66] RcppParallel_5.1.5 vctrs_0.5.1 ids_1.0.1 MatrixModels_0.5-0 gtable_0.3.1
## [71] openssl_2.0.5 assertthat_0.2.1 cachem_1.0.6 xfun_0.35 RcppEigen_0.3.3.9.3
## [76] e1071_1.7-12 filehash_2.4-3 class_7.3-20 tibble_3.1.8 orderly_1.4.3
## [81] units_0.8-0 ellipsis_0.3.2