modified: Wed Mar 1 12:56:03 2017 compiled: Mon Oct 30 20:43:05 2017

1 Introduction

bacon can be used to remove inflation and bias often observed in epigenome- and transcriptome-wide association studies (Iterson, Zwet, and Heijmans 2017).

To this end bacon constructs an empirical null distribution using a Gibbs Sampling algorithm by fitting a three-component normal mixture on z-scores. One component is forced, using prior knowledge, to represent the null distribution with mean and standard deviation representing the bias and inflation. The other two components are necessary to capture the amount of true associations present in the data, which we assume unknown but small.

bacon provides functionality to inspect the output of the Gibbs Sampling algorithm, i.e., plots of traces, posterior distributions and the mixture fit, are provided. Furthermore, inflation- and bias-corrected test-statistics or P-values are extracted easily. In addition, functionality for performing fixed-effect meta-analysis are provided as well.

The function bacon requires a vector or a matrix of z-scores, e.g., those extracted from association analyses using a linear regression approach. For fixed-effect meta-analysis a matrix of effect-sizes and standard-errors is required.

This vignette illustrates the use of bacon using simulated z-scores, effect-sizes and standard errors to avoid long run-times. If multiple sets of test-statisics or effect-sizes and standard-errors are provided, the Gibbs Sampler algorithm can be executed in parallel to reduce computation time using functionality provide by BiocParallel-package.

2 A single set of test-statistics

A vector containing \(5000\) z-scores is generated from a normal mixture distribution, \(90\%\) of the z-scores were drawn from a biased and inflated null distribution, \(\mathcal{N}(0.2, 1.3)\), and the remaining z-scores from \(\mathcal{N}(\mu, 1)\), where \(\mu \sim \mathcal{N}(4, 1)\). The rnormmix-function provided by Bacon generates a vector of random test-statistics described above optionally with different parameters.

y <- rnormmix(5000, c(0.9, 0.2, 1.3, 1, 4, 1))

The function bacon executes the Gibbs Sampler algorithm and stores all in- and out-put in an object of class Bacon. Several accessor-functions are available to access data contained in the Bacon-object, e.g. for obtaining the estimated parameters of the mixture fit or explicitly the bias and inflation. Actually, the latter two are the mean and standard deviation of the null component (mu.0 and sigma.0).

bc <- bacon(y)
bc
## Bacon-object containing 1 set(s) of 5000 test-statistics.
## ...estimated bias: 0.15.
## ...estimated inflation: 1.3.
## 
## Empirical null estimates are based on 5000 iterations with a burnin-period of 2000.
estimates(bc)
##        p.0    p.1    p.2  mu.0 mu.1  mu.2 sigma.0 sigma.1 sigma.2
## [1,] 0.902 0.0694 0.0286 0.148 3.13 -2.98     1.3    2.95    2.51
inflation(bc)
## sigma.0 
##     1.3
bias(bc)
##  mu.0 
## 0.148

Several methods are provided to inspect the output of the Gibbs Sampler algorithm, such as traces-plots of all estimates, plots of posterior distributions, provide as a scatter plot between two parameters, and the actual fit of the three component mixture to the histogram of z-scores.

traces(bc, burnin=FALSE)
Plot of Gibbs Sampling traces. Each panel represent of one the estimated parameters. Default plot shows the burin-in period as well.

Figure 1: Plot of Gibbs Sampling traces
Each panel represent of one the estimated parameters. Default plot shows the burin-in period as well.

posteriors(bc)
Gibbs Sampling posterior distributions of two estimated parameters the inflation (sigma 0) and proportion of null features (pi0 0). Posterior plots of the other parameters can be generated by using the `thetas` argument. The ellipical curves corresponding to a 75%, 90% and 95% probability regions for a bivariate normal distribution with mean and covariance estimated form the scatter-plot.

Figure 2: Gibbs Sampling posterior distributions of two estimated parameters the inflation (sigma 0) and proportion of null features (pi0 0)
Posterior plots of the other parameters can be generated by using the thetas argument. The ellipical curves corresponding to a 75%, 90% and 95% probability regions for a bivariate normal distribution with mean and covariance estimated form the scatter-plot.

fit(bc, n=100)
Fit to the data as estimated using the Gibbs Sampling algorithm. Black line represent to overall fit, red the fit of the null distribution and blue and green the alternatives.

Figure 3: Fit to the data as estimated using the Gibbs Sampling algorithm
Black line represent to overall fit, red the fit of the null distribution and blue and green the alternatives.

The previous three plots can be use as diagnostic tools to inspect the Gibbs sampling process.

There is also a generic plot function that can generate two types of plots; a histogram of the z-scores and a qq-plot. The histogram of the z-scores shows on top the standard normal distribution and the Gibbs Sampling estimated empirical null distribution. The quantile-quantile plot shows the \(-log_{10}\) transformed P-values. Default values are raw, not controlled for bias and inflation, z-scores and P-values.

plot(bc, type="hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of z-scores. With on top standard normal (black) and estimated empirical null distribution (red).

Figure 4: Histogram of z-scores
With on top standard normal (black) and estimated empirical null distribution (red).

plot(bc, type="qq")
Quantile-quantile plot of $-log_{10}$ transformed P-values. Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

Figure 5: Quantile-quantile plot of \(-log_{10}\) transformed P-values
Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

3 Multiple sets of test-statistics

Matrices containing \(5000\times6\) effect-sizes and standard errors are generated to simulated data for a fixed-effect meta-analyses. This is a toy-example just to illustrate the capabilities of bacon in handling multiple sets of test-statics.

set.seed(12345)
biases <- runif(6, -0.2, 0.2)
inflations <- runif(6, 1, 1.3)
es <- matrix(nrow=5000, ncol=6)
for(i in 1:6)
    es[,i] <- rnormmix(5000, c(0.9, biases[i], inflations[i], 0, 4, 1), shuffle=FALSE)
se <- replicate(6, 0.8*sqrt(4/rchisq(5000,df=4)))
colnames(es) <- colnames(se) <- LETTERS[1:ncol(se)]
rownames(es) <- rownames(se) <- 1:5000
head(rownames(es))
## [1] "1" "2" "3" "4" "5" "6"
head(colnames(es))
## [1] "A" "B" "C" "D" "E" "F"

By default the function bacon detects the number of cores/nodes registered, as described in the BiocParallel, to perform bacon in parallel. To run the vignette in general we set it here for convenience to 1 node.

library(BiocParallel)
register(MulticoreParam(1, log=TRUE))
bc <- bacon(NULL, es, se)
## Did you registered a biocparallel back-end?
##  Continuing serial!
bc
## Bacon-object containing 6 set(s) of 5000 test-statistics.
## ...estimated bias: 0.064,0.093,0.089,0.051,0.018,-0.075.
## ...estimated inflation: 1.2,1.3,1.3,1.3,1.1,1.1.
## 
## Empirical null estimates are based on 5000 iterations with a burnin-period of 2000.
knitr::kable(estimates(bc))
p.0 p.1 p.2 mu.0 mu.1 mu.2 sigma.0 sigma.1 sigma.2
A 0.868 0.072 0.060 0.064 2.64 -2.68 1.19 3.64 3.20
B 0.878 0.070 0.051 0.093 2.84 -2.79 1.30 3.01 3.67
C 0.853 0.079 0.067 0.089 2.66 -2.72 1.30 3.22 3.37
D 0.831 0.059 0.110 0.051 3.00 -1.12 1.33 1.54 4.59
E 0.881 0.059 0.060 0.018 2.69 -2.62 1.15 4.02 3.45
F 0.859 0.061 0.080 -0.075 2.74 -2.60 1.15 3.57 3.25
inflation(bc)
##    A    B    C    D    E    F 
## 1.19 1.30 1.30 1.33 1.15 1.14
bias(bc)
##       A       B       C       D       E       F 
##  0.0643  0.0935  0.0887  0.0505  0.0175 -0.0747
knitr::kable(tstat(bc)[1:5,])
A B C D E F
-0.669 0.607 -0.613 -0.723 0.183 -0.991
0.360 0.260 0.242 -3.219 -0.783 2.524
-0.488 -0.036 -0.134 -0.805 0.794 -0.274
0.116 -2.718 -0.911 -1.588 0.462 0.296
0.569 0.907 1.924 0.844 2.026 -1.196
knitr::kable(pval(bc)[1:5,])
A B C D E F
0.503 0.544 0.540 0.469 0.855 0.322
0.719 0.795 0.808 0.001 0.433 0.012
0.626 0.971 0.893 0.421 0.427 0.784
0.908 0.007 0.362 0.112 0.644 0.768
0.569 0.364 0.054 0.399 0.043 0.232
knitr::kable(se(bc)[1:5,])
A B C D E F
1.057 0.919 1.889 1.806 1.255 0.896
0.856 1.666 1.948 1.042 0.898 0.799
1.342 1.451 0.881 1.277 1.036 1.078
2.066 1.239 1.705 0.757 0.799 2.260
2.525 1.182 0.681 0.736 0.825 0.979
knitr::kable(es(bc)[1:5,])
A B C D E F
-0.708 0.558 -1.158 -1.306 0.229 -0.888
0.308 0.433 0.472 -3.354 -0.703 2.017
-0.655 -0.053 -0.118 -1.028 0.822 -0.295
0.240 -3.368 -1.553 -1.203 0.369 0.668
1.437 1.072 1.309 0.621 1.671 -1.171

The accessor-function return as expected matrices of estimates. For the plotting functions an additional index of the ith study or z-score is required.

traces(bc, burnin=FALSE, index=3)
Plot of Gibbs Sampling traces. Each panel represent of one the estimated parameters. Default plot shows the burin-in period as well.

Figure 6: Plot of Gibbs Sampling traces
Each panel represent of one the estimated parameters. Default plot shows the burin-in period as well.

posteriors(bc, index=3)
Gibbs Sampling posterior distributions of two estimated parameters the inflation (sigma 0) and proportion of null features (pi0 0). Posterior plots of the other parameters can be generated by using the `thetas` argument. The ellipical curves corresponding to a 75%, 90% and 95% probability regions for a bivariate normal distribution with mean and covariance estimated form the scatter-plot.

Figure 7: Gibbs Sampling posterior distributions of two estimated parameters the inflation (sigma 0) and proportion of null features (pi0 0)
Posterior plots of the other parameters can be generated by using the thetas argument. The ellipical curves corresponding to a 75%, 90% and 95% probability regions for a bivariate normal distribution with mean and covariance estimated form the scatter-plot.

fit(bc, n=100, index=3)
Fit to the data as estimated using the Gibbs Sampling algorithm. Black line represent to overall fit, red the fit of the null distribution and blue and green the alternatives.

Figure 8: Fit to the data as estimated using the Gibbs Sampling algorithm
Black line represent to overall fit, red the fit of the null distribution and blue and green the alternatives.

plot(bc, type="hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of z-scores. With on top standard normal (black) and estimated empirical null distribution (red).

Figure 9: Histogram of z-scores
With on top standard normal (black) and estimated empirical null distribution (red).

plot(bc, type="qq")
Quantile-quantile plot of $-log_{10}$ transformed P-values. Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

Figure 10: Quantile-quantile plot of \(-log_{10}\) transformed P-values
Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

4 Fixed-effect meta-analysis

The following code chunk shows how to perform fixed-effect meta-analysis and the inspection of results.

bcm <- meta(bc)
head(pval(bcm))
##       A       B      C       D      E      F   meta
## 1 0.503 0.54373 0.5396 0.46946 0.8551 0.3219 0.4372
## 2 0.719 0.79515 0.8084 0.00128 0.4335 0.0116 0.9660
## 3 0.626 0.97091 0.8931 0.42071 0.4272 0.7840 0.7615
## 4 0.908 0.00657 0.3623 0.11222 0.6444 0.7675 0.0621
## 5 0.569 0.36431 0.0544 0.39868 0.0427 0.2316 0.0225
## 6 0.279 0.18887 0.7676 0.56422 0.0229 0.3042 0.0110
print(topTable(bcm))
##      eff.size.meta std.err.meta pval.adj.meta pval.org.meta tstat.meta
## 4976         -5.86        0.359      2.60e-56      5.21e-60      -16.3
## 4617          5.27        0.404      3.20e-35      6.39e-39       13.0
## 4820          4.20        0.322      3.29e-35      6.57e-39       13.0
## 4520          3.89        0.320      3.59e-30      7.17e-34       12.1
## 4804          5.26        0.437      1.16e-29      2.32e-33       12.0
## 4919          4.54        0.378      1.53e-29      3.07e-33       12.0
## 4562          4.57        0.384      4.42e-29      8.84e-33       11.9
## 4918         -4.20        0.366      9.58e-27      1.92e-30      -11.5
## 4585         -3.42        0.312      2.85e-24      5.71e-28      -11.0
## 4567         -4.32        0.394      2.94e-24      5.89e-28      -11.0
##      eff.size.A std.err.A   pval.A tstat.A eff.size.B std.err.B   pval.B
## 4976     -0.657     1.399 6.39e-01 -0.4696     -2.804     1.597 7.90e-02
## 4617      7.637     0.947 7.40e-16  8.0637      1.248     0.977 2.02e-01
## 4820      2.148     0.805 7.62e-03  2.6685     -5.913     0.935 2.53e-10
## 4520      0.648     1.561 6.78e-01  0.4149      0.755     0.700 2.81e-01
## 4804      4.038     1.029 8.70e-05  3.9243     -0.756     2.332 7.46e-01
## 4919      8.105     0.571 9.51e-46 14.1974     -0.606     1.429 6.72e-01
## 4562      2.817     0.848 8.91e-04  3.3228      8.665     0.608 3.99e-46
## 4918     -0.404     1.466 7.83e-01 -0.2755     -7.780     1.047 1.10e-13
## 4585     -2.734     0.966 4.64e-03 -2.8313      1.967     0.880 2.55e-02
## 4567      0.070     1.886 9.70e-01  0.0371     -6.144     0.862 1.04e-12
##      tstat.B eff.size.C std.err.C   pval.C tstat.C eff.size.D std.err.D
## 4976  -1.756    -7.0712     1.336 1.22e-07 -5.2910      2.892     1.016
## 4617   1.277    -2.2205     1.391 1.11e-01 -1.5959     -0.357     1.947
## 4820  -6.325    -9.1407     0.967 3.34e-21 -9.4516      1.070     2.107
## 4520   1.078     1.3994     1.602 3.82e-01  0.8738     -0.501     0.670
## 4804  -0.324     1.7008     1.232 1.68e-01  1.3801      9.846     0.865
## 4919  -0.424     1.8062     0.848 3.32e-02  2.1293     -3.252     1.171
## 4562  14.258     5.7236     2.163 8.13e-03  2.6465      3.657     1.533
## 4918  -7.428    -1.4107     1.803 4.34e-01 -0.7824     -3.416     0.760
## 4585   2.234     5.1212     0.927 3.31e-08  5.5241      1.535     1.244
## 4567  -7.125    -0.0976     1.636 9.52e-01 -0.0596     -5.254     0.703
##        pval.D tstat.D eff.size.E std.err.E    pval.E tstat.E eff.size.F
## 4976 4.40e-03   2.848     -11.24     0.539  1.76e-96  -20.84     -2.260
## 4617 8.54e-01  -0.184       5.17     0.861  1.95e-09    6.00      9.094
## 4820 6.11e-01   0.508      10.65     0.454 1.37e-121   23.45      2.292
## 4520 4.55e-01  -0.748      10.24     0.526  2.24e-84   19.46     -0.351
## 4804 4.75e-30  11.389       5.76     0.840  7.13e-12    6.85      2.036
## 4919 5.50e-03  -2.776       5.30     1.034  2.93e-07    5.13      5.497
## 4562 1.70e-02   2.386      -4.46     1.363  1.06e-03   -3.27      2.189
## 4918 6.96e-06  -4.495      -6.21     0.636  1.59e-22   -9.77     -1.492
## 4585 2.17e-01   1.234      -7.20     0.425  2.29e-64  -16.94     -3.267
## 4567 7.94e-14  -7.471      -9.74     0.841  4.91e-31  -11.58      4.866
##      std.err.F   pval.F tstat.F
## 4976     0.727 1.88e-03  -3.108
## 4617     0.736 4.81e-35  12.351
## 4820     1.101 3.75e-02   2.081
## 4520     0.972 7.18e-01  -0.361
## 4804     1.196 8.88e-02   1.702
## 4919     1.621 6.97e-04   3.391
## 4562     0.812 7.00e-03   2.697
## 4918     0.795 6.06e-02  -1.876
## 4585     1.342 1.49e-02  -2.434
## 4567     0.999 1.12e-06   4.869
plot(bcm, type="qq")
Quantile-quantile plot of $-log_{10}$ transformed P-values for each cohort and the meta-analysis P-values. Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

Figure 11: Quantile-quantile plot of \(-log_{10}\) transformed P-values for each cohort and the meta-analysis P-values
Left panel using uncorrected P-values and right panel using bacon bias and inflation corrected P-values.

5 Session Info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bacon_1.6.0         ellipse_0.3-8       BiocParallel_1.12.0
## [4] ggplot2_2.2.1       BiocStyle_2.6.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     knitr_1.17       magrittr_1.5     munsell_0.4.3   
##  [5] colorspace_1.3-2 rlang_0.1.2      highr_0.6        stringr_1.2.0   
##  [9] plyr_1.8.4       tools_3.4.2      parallel_3.4.2   grid_3.4.2      
## [13] gtable_0.2.0     htmltools_0.3.6  yaml_2.1.14      lazyeval_0.2.1  
## [17] rprojroot_1.2    digest_0.6.12    tibble_1.3.4     bookdown_0.5    
## [21] evaluate_0.10.1  rmarkdown_1.6    labeling_0.3     stringi_1.1.5   
## [25] compiler_3.4.2   scales_0.5.0     backports_1.1.1

References

Iterson, M. van, E. W. van Zwet, and B. T. Heijmans. 2017. “Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution.” Genome Biol. 18 (1): 19.