This notebook describes how to run power analyses for DNAm arrays on user-defined datasets with the pwrEWAS method. The original pwrEWAS method is available as a Bioconductor package. There was need to make the original method extensible to new user-provided datasets, and this vignette describes how to do this with a lightly modified power analysis function, pwrEWAS_itable().

Source the revised function, pwrEWAS_itable()

Source the revised function from GitHub with source_url(). This runs the script pwrEWAS_revised.R, producing a series of callable functions in the active R session.

revised_function_url <- paste0("https://github.com/metamaden/pwrEWAS/", "blob/master/inst/revised_functions/pwrEWAS_revised.R?raw=TRUE")
devtools::source_url(revised_function_url)

Generate DNAm summary statistics

pwrEWAS requires a set of DNAm summary statistics, specifically a table with DNAm means (e.g. a column named “mu”) and variances (e.g. a column named “var”) to inform power analysis simulations. For this example, get DNAm summary statistics from minfiData’s example array data, stored as a MethylSet, then use rowMeans() and rowVars() to compute summaries from Beta-values.

library(minfiData)
data("MsetEx") # load MethylSet
ms <- MsetEx 
bval <- getBeta(ms) # get DNAm fractions
# get the summary data frame
dfpwr <- data.frame(mu = rowMeans(bval, na.rm = T),
                    var = rowVars(bval, na.rm = T)) 
head(dfpwr)

The particular samples used to generate the CpG probe summary statistics above can be important. Samples from a specific tissue and/or demographic may yield more relevant information from power analysis for a given experiment design task.

Run pwrEWAS_itable()

There are numerous parameters for fine-tuning power analyses. For the demonstration runs below, set the following parameter values.

ttype <- dfpwr               # tissueType
mintss <- 10                 # minTotSampleSize
maxtss <- 1000               # maxTotSampleSize
sstep <- 100                 # SampleSizeSteps
tdeltav <- c(0.05, 0.1, 0.2) # targetDelta
dmethod <- "limma"           # DMmethod
fdr <- 0.05                  # FDRcritVal
nsim <- 20                   # sims
j <- 1000                    # J
ndmp <- 50                   # targetDmCpGs
detlim <- 0.01               # detectionLimit
maxctau <- 100               # maxCnt.tau
ncntper <- 0.5               # NcntPer

These effectively define the power analysis as a series of tests varying samples from a minimum of 10, to a maximum of 230, at intervals of 20 samples, for a total of 12 total max sample sizes tested with evenly distributed experiment groups. For instance, at 10 total samples each experiment group has 5 samples, etc.

Further, simulations use the limma() function for differential methylation test, with 50 significant probes expected at an FDR significance of 0.05 from among 5000 total simulated probes. Mean DNAm differences between experiment groups are varied across 3 possible values, either 0.05, 0.1, or 0.2. Finally, each set of test parameters will be simulated 20 times.

Run power simulations with 2 cores

Setting the method parameters as above, run pwrEWAS on multiple cores by setting the argument core to some value >1. But first set the seed to ensure run reproducibility.

set.seed(0)
lpwr.c2 <- pwrEWAS_itable(core = 2, 
                          tissueType = ttype, minTotSampleSize = mintss, 
                          maxTotSampleSize = maxtss, SampleSizeSteps = sstep, 
                          NcntPer = ncntper, targetDelta = tdeltav, J = j, 
                          targetDmCpGs = ndmp, detectionLimit = detlim, 
                          DMmethod = dmethod, FDRcritVal = fdr, 
                          sims = nsim, maxCnt.tau = maxctau)
# [2022-02-17 13:44:51] Finding tau...done [2022-02-17 13:45:06]
# [1] "The following taus were chosen: 0.013671875, 0.02734375, 0.0546875"
# [2022-02-17 13:45:06] Running simulation
# [2022-02-17 13:45:06] Running simulation ... done [2022-02-17 13:48:23]

The commented status messages show the example run time was about 3:30.

Access the power analysis results

Power analysis results are returned in a list of four objects called "meanPower" (a matrix), "powerArray" (an array), "deltaArray" (a list), and "metric", (also a list).

The first object, meanPower, shows the mean power (cell values) by total sample size (y-axis, rows, from 10 to 230) and delta DNAm difference (x-axis, columns) across simulations. The dimensions and first rows of this table are shown below.

lpwr <- lpwr.c1           # get results from an above example
mp <- lpwr[["meanPower"]] # get the mean power table
dim(mp) # get the dimensions of mean power table
head(mp) # get first rows of mean power table

The second object is powerArray, an array of matrices containing 720 data points. This data is used to calculate the meanPower summaries, which can be seen by comparing mean of the first 20 powerArray values (e.g. the 20 simulations where total samples is 10 and delta is 0.05) to the meanPower cell [1,1] corresponding to delta = 0.1, total samples = 10.

pa <- lpwr$powerArray # get power array
length(pa) # get length of power array
## [1] 600
mean(pa[1:20]) == mp[1,1] # compare means, power array vs. mean power table
## [1] TRUE

The final objects show various observed values for the delta DNAm (in the deltaArray object), and the marginal type I error, classical power, FDR, FDC, and true positive probability (in the metric object).

Plot smooths with errors using ggplot2

This section shows how to use the ggplot2 package to generate publication-ready plot summaries of pwrEWAS power analysis results.

To plot the full simulation results with smooths and standard errors, reformat the array of matrices in the powerArray object. Extract the power values according to the dimensions of our simulations (e.g. 10 sample sizes times 10 simulations times 2 deltas = 200 total simulations). Finally, coerce and harmonize power values across deltas to form a tall data frame for plotting.

# extract power values from the array of matrices
parr <- pa
m1 <- data.frame(power = parr[1:200])   # first delta power values
m2 <- data.frame(power = parr[201:400]) # second delta power values
m3 <- data.frame(power = parr[401:600]) # third delta power values
# assign total samples to power values
m1$total.samples <- m2$total.samples <- 
  m3$total.samples <- rep(seq(10, 910, 100), each = 20)
# add delta labels
m1$`Delta\nDNAm` <- rep("0.05", 200)
m2$`Delta\nDNAm` <- rep("0.1", 200)
m3$`Delta\nDNAm` <- rep("0.2", 200)
# make the tall data frame for plotting
dfp <- rbind(m1, rbind(m2, m3))

Make the final plot using goem_smooth(), which uses method=loess here by default. You can specify other methods with the method argument (see ?geom_smooth for details). Again, the horizontal line at 80% power is included for reference.

ggplot(dfp, aes(x = total.samples, y = power, color = `Delta\nDNAm`)) +
  geom_smooth(se = T, method = "loess") + theme_bw() + xlab("Total samples") + 
  ylab("Power") + geom_hline(yintercept = 0.8, color = "black", linetype = "dotted")

Including the standard errors lends some confidence to our findings. First, we can tell there is a great deal of separation between each of the delta models throughout all simulations except at the very lowest total sample sizes. Further, at the highest total sample size the power exceeds 80% with high confidence at delta = 0.2 (e.g. lowest standard error well above reference line), with less confidence where delta = 0.1 (e.g. lowest standard error touches reference line), and not at all where delta = 0.05 (e.g. highest standard error is well below reference line).

Next steps and further reading

This vignette showed how to used a lightly modified implementation of pwrEWAS on a custom user-provided DNAm dataset. It showed how to generate DNAm summary statistics, use these in the pwrEWAS_itable() function, identify simulation and summary outcomes in returned results data, and plot simulation results with errors using ggplot2.

Note the values of arguments like J, nsim, and maxtss can be increased in practice to yield a more robust power model. The values of arguments including FDRcritVal and tdeltav can further be set according to the empirical results of perliminary analyses or literature review to yield more informative results.

More details about the pwrEWAS method, including more fuction parameter details, can be found in the function docstrings (e.g. check ?pwrEWAS), descriptions in the Bioconductor package documentation and in the main study, (Graw et al. 2019)(https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2804-7).

Session Info

utils::sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] HDF5Array_1.22.1            DelayedArray_0.20.0        
##  [3] Matrix_1.4-0                limma_3.50.1               
##  [5] gridExtra_2.3               ggplot2_3.3.5              
##  [7] knitr_1.37                  recountmethylation_1.4.5   
##  [9] minfi_1.40.0                bumphunter_1.36.0          
## [11] locfit_1.5-9.5              iterators_1.0.14           
## [13] foreach_1.5.2               Biostrings_2.62.0          
## [15] XVector_0.34.0              SummarizedExperiment_1.24.0
## [17] Biobase_2.54.0              MatrixGenerics_1.6.0       
## [19] matrixStats_0.61.0          GenomicRanges_1.46.1       
## [21] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [23] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [25] rhdf5_2.38.1                BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_2.2.1       plyr_1.8.6               
##   [3] splines_4.1.2             BiocParallel_1.28.3      
##   [5] digest_0.6.29             htmltools_0.5.2          
##   [7] magick_2.7.3              fansi_1.0.2              
##   [9] magrittr_2.0.2            memoise_2.0.1            
##  [11] tzdb_0.2.0                readr_2.1.2              
##  [13] annotate_1.72.0           askpass_1.1              
##  [15] siggenes_1.68.0           prettyunits_1.1.1        
##  [17] colorspace_2.0-3          blob_1.2.2               
##  [19] rappdirs_0.3.3            xfun_0.30                
##  [21] dplyr_1.0.8               crayon_1.5.0             
##  [23] RCurl_1.98-1.6            jsonlite_1.8.0           
##  [25] genefilter_1.76.0         GEOquery_2.62.2          
##  [27] survival_3.3-1            glue_1.6.2               
##  [29] gtable_0.3.0              zlibbioc_1.40.0          
##  [31] Rhdf5lib_1.16.0           scales_1.1.1             
##  [33] DBI_1.1.2                 rngtools_1.5.2           
##  [35] Rcpp_1.0.8.2              xtable_1.8-4             
##  [37] progress_1.2.2            bit_4.0.4                
##  [39] mclust_5.4.9              preprocessCore_1.56.0    
##  [41] httr_1.4.2                RColorBrewer_1.1-2       
##  [43] ellipsis_0.3.2            farver_2.1.0             
##  [45] pkgconfig_2.0.3           reshape_0.8.8            
##  [47] XML_3.99-0.9              sass_0.4.0               
##  [49] dbplyr_2.1.1              utf8_1.2.2               
##  [51] labeling_0.4.2            tidyselect_1.1.2         
##  [53] rlang_1.0.2               AnnotationDbi_1.56.2     
##  [55] munsell_0.5.0             tools_4.1.2              
##  [57] cachem_1.0.6              cli_3.2.0                
##  [59] generics_0.1.2            RSQLite_2.2.10           
##  [61] evaluate_0.15             stringr_1.4.0            
##  [63] fastmap_1.1.0             yaml_2.3.5               
##  [65] bit64_4.0.5               beanplot_1.2             
##  [67] scrime_1.3.5              purrr_0.3.4              
##  [69] KEGGREST_1.34.0           nlme_3.1-155             
##  [71] doRNG_1.8.2               sparseMatrixStats_1.6.0  
##  [73] nor1mix_1.3-0             xml2_1.3.3               
##  [75] biomaRt_2.50.3            compiler_4.1.2           
##  [77] filelock_1.0.2            curl_4.3.2               
##  [79] png_0.1-7                 tibble_3.1.6             
##  [81] bslib_0.3.1               stringi_1.7.6            
##  [83] highr_0.9                 GenomicFeatures_1.46.5   
##  [85] lattice_0.20-45           multtest_2.50.0          
##  [87] vctrs_0.3.8               pillar_1.7.0             
##  [89] lifecycle_1.0.1           rhdf5filters_1.6.0       
##  [91] BiocManager_1.30.16       jquerylib_0.1.4          
##  [93] data.table_1.14.2         bitops_1.0-7             
##  [95] rtracklayer_1.54.0        R6_2.5.1                 
##  [97] BiocIO_1.4.0              bookdown_0.24            
##  [99] codetools_0.2-18          MASS_7.3-55              
## [101] assertthat_0.2.1          openssl_2.0.0            
## [103] rjson_0.2.21              withr_2.5.0              
## [105] GenomicAlignments_1.30.0  Rsamtools_2.10.0         
## [107] GenomeInfoDbData_1.2.7    mgcv_1.8-39              
## [109] hms_1.1.1                 quadprog_1.5-8           
## [111] grid_4.1.2                tidyr_1.2.0              
## [113] base64_2.0                rmarkdown_2.13           
## [115] DelayedMatrixStats_1.16.0 illuminaio_0.36.0        
## [117] restfulr_0.0.13

Works Cited

Graw, Stefan, Rosalyn Henn, Jeffrey A. Thompson, and Devin C. Koestler. 2019. “pwrEWAS: A User-Friendly Tool for Comprehensive Power Estimation for Epigenome Wide Association Studies (EWAS).” BMC Bioinformatics 20 (1): 218. https://doi.org/10.1186/s12859-019-2804-7.