EpiCompare
: Getting startedThe EpiCompare package is designed to facilitate the comparison of epigenomic datasets for quality control and benchmarking purposes. The package combines several downstream analysis tools for epigenomic data and generates a single report that collates all results of the analysis. This allows users to conduct downstream analysis of multiple epigenomic datasets simultaneously and compare the results in a simple and efficient way.
For many years, ChIP-seq has been the standard method for epigenomic profiling, but it suffers from a host of limitations. Recently, many other epigenomic technologies (e.g. CUT&Run, CUT&Tag and TIP-seq etc.), designed to overcome these constraints, have been developed. To better understand the performance of these novel approaches, it is important that we systematically compare these technologies and benchmark against a “gold-standard”.
There are many tools in R (e.g. ChIPseeker) that can be used to conduct downstream analysis and comparison of epigenomic datasets. However, these are often scattered across different packages and difficult to use for researchers with none or little computational experience.
EpiCompare is designed to provide a simple and comprehensive way of analysing and comparing epigenomic datasets. It combines many useful downstream analysis tools, which can easily be controlled by users and it collates the results in a single report. This allows researchers to systematically compare different epigenomic technologies.
While the main functionality of EpiCompare is to contrast epigenomic technologies, it can also be used to compare datasets generated using different experimental conditions and data analysis workflows of one technology. This can help researchers to establish a consensus regarding the optimal use of the method.
Currently, EpiCompare only works for human genome as it uses human-based hg19 and/or hg38 genome references.
The EpiCompare package contains a small subset of histone mark H3K27ac profile data obtained/generated from:
It also contains human genome hg19 and hg38 blacklisted regions obtained from ENCODE. The ENCODE blacklist includes regions of the genome that have anomalous and/or unstructured signals independent of the cell-line or experiment. Removal of ENCODE blacklist is recommended for quality measure.
These dataset will be used to showcase EpiCompare functionality
To install the package, run the following:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EpiCompare")
In this example analysis, we will compare CUT&Run and CUT&Tag of histone mark H3K27ac against ENCODE ChIP-seq.
Once installed, load the package:
library(EpiCompare)
##
## Warning: replacing previous import 'Biostrings::pattern' by 'grid::pattern'
## when loading 'genomation'
Load example datasets used in this analysis:
data("encode_H3K27ac") # ENCODE ChIP-seq
data("CnT_H3K27ac") # CUT&Tag
data("CnR_H3K27ac") # CUT&Run
data("hg19_blacklist") # hg19 genome blacklist
data("CnT_H3K27ac_picard") # CUT&Tag Picard summary output
data("CnR_H3K27ac_picard") # CUT&Run Picard summary output
EpiCompare accepts datasets both as GRanges
object and as paths to BED
files. Peakfiles (GRanges
or paths) that you would like to analyse must be
listed and named (see below).
# To import BED files as GRanges object
peak_GRanges <-ChIPseeker::readPeakFile("/path/to/peak/file.bed",as = "GRanges")
# EpiCompare also accepts paths (to BED files) as input
peak_path <- "/path/to/BED/file1.bed"
# Create named peak list
peaklist <- list(peak_GRanges, peak_path)
names(peaklist) <- c("sample1", "sample2")
In this example, we will use built-in data, which have been converted into
GRanges
object previously (CnT_H3K27ac
and CnR_H3K27ac
).
peaklist <- list(CnT_H3K27ac, CnR_H3K27ac) # create list of peakfiles
names(peaklist) <- c("CnT", "CnR") # set names
ENCODE blacklist contains regions of the genome that have anomalous and/or unstructured signals independent of the cell-line or experiment. Removal of these regions from peakfiles is recommended for quality measure.
EpiCompare has three built-in blacklist files, hg19, hg38 and mm10, downloaded
from ENCODE. Run ?hg19_blacklist
for more information.
In this example analysis, since all peakfiles (encode_H3K27ac
, CnT_H3K27ac
,
CnR_H3K27ac
) were generated using human genome reference build hg19,
hg19_blacklist
will be used. For hg38, use data(hg38_blacklist)
.
Please ensure that you specify the correct blacklist.
Note that this is OPTIONAL. If you want the report to include metrics on DNA fragments (e.g. mapped fragments and duplication rate), please input summary files from Picard.
Picard MarkDuplicates can be used to mark duplicate reads that are found within the alignment. This
tool outputs a metrics file with the ending .MarkDuplicates.metrics.txt
. To
import this text file into R as data frame, use:
picard <- read.table("/path/to/Picard/.MarkDuplicates.metrics.txt", header = TRUE, fill = TRUE)
In this example. we will use built-in data, which have been converted into data
frame previously (CnT_H3K27ac_picard
and CnR_H3K27ac_picard
). The files
must be listed and named:
# create list of Picard summary
picard <- list(CnT_H3K27ac_picard, CnR_H3K27ac_picard)
names(picard) <- c("CnT", "CnR") # set names
This is OPTIONAL. If reference peak file is provided, stat_plot
and
chromHMM_plot
of overlapping peaks are included in the report (see
Optional plots section below).
Reference file must be listed and named. In this example, we will use built-in
data (encode_H3K27ac
), which has been converted to GRanges
previously:
reference_peak <- list("ENCODE_H3K27ac" = encode_H3K27ac)
When running EpiCompare()
, please ensure that you specify output_dir
. All
outputs (figures and HTML report) will be saved in the specified output_dir
.
Running EpiCompare is done using the function, EpiCompare()
. Users can choose
which analyses to run and include in the report by setting parameters to TRUE
or FALSE
.
EpiCompare(peakfiles = peaklist,
genome_build = "hg19",
blacklist = hg19_blacklist,
picard_files = picard,
reference = reference_peak,
upset_plot = FALSE,
stat_plot = FALSE,
chromHMM_plot = FALSE,
chromHMM_annotation = "K562",
chipseeker_plot = FALSE,
enrichment_plot = FALSE,
tss_plot = FALSE,
interact = FALSE,
save_output = FALSE,
output_filename = "EpiCompare_test",
output_timestamp = FALSE,
output_dir = tempdir())
## NOTE: The following EpiCompare features are NOT being used:
## - upset_plot=
## - stat_plot=
## - chromHMM_plot=
## - chipseeker_plot=
## - enrichment_plot=
## - tss_plot=
## - precision_recall_plot=
## - corr_plot=
## - add_download_button=
## processing file: EpiCompare.Rmd
## output file: EpiCompare.knit.md
## /usr/bin/pandoc +RTS -K512m -RTS EpiCompare.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp6GwVWa/EpiCompare_test.html --lua-filter /home/biocbuild/bbs-3.21-bioc/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/biocbuild/bbs-3.21-bioc/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_collapsed=1 --variable toc_smooth_scroll=1 --variable toc_print=1 --template /home/biocbuild/bbs-3.21-bioc/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --number-sections --variable theme=bootstrap --css custom.css --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp6GwVWa/rmarkdown-strb86967bf08a5.html --variable code_folding=hide --variable code_menu=1
##
## Output created: /tmp/Rtmp6GwVWa/EpiCompare_test.html
## [1] "Done in 0.11 min."
## All outputs saved to: /tmp/Rtmp6GwVWa
## [1] "/tmp/Rtmp6GwVWa/EpiCompare_test.html"
By default, these plots will not be included in the report unless set TRUE
.
upset_plot
: Upset plot showing the number of overlapping peaks between
samples. EpiCompare uses UpSetR
package.stat_plot
: A reference
peakfile must be included for this plot. The plot
displays distribution of statistical significance (q-values) of sample peaks
that are overlapping/non-overlapping with the reference
dataset.chromHMM_plot
: ChromHMM
annotation of peaks. If reference
is provided, ChromHMM annotation of
overlapping and non-overlapping peaks with the reference
is also included in
the report.chipseeker_plot
: ChIPseeker
functional annotation of peaks.enrichment_plot
: KEGG pathway and GO enrichment analysis of peaks.tss_plot
: Peak frequency around (+/- 3000bp) transcriptional start site.
Note that it may take awhile to generate this plot for large sample sizes.chromHMM_annotation
: Cell-line annotation for ChromHMM. Default is K562.
Options are:
interact
: By default, all heatmaps (percentage overlap and
ChromHMM heatmaps) in the report will be interactive. If set FALSE, all heatmaps
will be static. N.B. If interact=TRUE
, interactive heatmaps will be saved as
html files, which may take time for larger sample sizes.output_filename
: By default, the report is named EpiCompare.html. You can
specify the filename of the report here.output_timestamp
: By default FALSE. If TRUE, the filename of the report
includes the date.EpiCompare outputs
EpiCompare_file
containing all plots generated by EpiCompare if
save_output = TRUE
.Both outputs are saved in the specified output_dir
.
In the current version, EpiCompare only recognizes certain BED formats. We hope to improve this. Moreover, if there are other downstream analysis tools that may be suitable in EpiCompare, feel free to report this through Github.
An example report comparing ATAC-seq, DNase-seq and ChIP-seq of K562 can be found here.
Code used to generate this:
## Load data
# ATAC-seq data: https://www.encodeproject.org/files/ENCFF558BLC/
atac1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF558BLC.bed", as="GRanges")
# Dnase-seq data: https://www.encodeproject.org/files/ENCFF274YGF/
dna1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF274YGF.bed", as="GRanges")
# Dnase-seq data: https://www.encodeproject.org/files/ENCFF185XRG/
dna2_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF185XRG.bed", as="GRanges")
# ChIP-seq data: https://www.encodeproject.org/files/ENCFF038DDS/
chip_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCODE_H3K27ac_hg38_ENCFF038DDS.bed", as="GRanges")
## Peaklist
peaklist <- list("ATAC_ENCFF558BLC" = atac1_hg38_unique,
"Dnase_ENCFF274YGF" = dna1_hg38,
"ChIP_ENCFF038DDS" = chip_hg38)
## Reference
reference <- list("Dnase_ENCFF185XRG_reference"=dna2_hg38)
## Blacklist
data("hg38_blacklist")
## Run Epicompare
EpiCompare(peakfiles = peaklist,
genome_build = "hg38",
genome_build_output = "hg38",
blacklist = hg38_blacklist,
reference = reference,
picard_files = NULL,
upset_plot = T,
stat_plot = T,
save_output = F,
chromHMM_plot = T,
chromHMM_annotation = "K562",
chipseeker_plot = T,
enrichment_plot = T,
tss_plot = T,
precision_recall_plot =T,
corr_plot = T,
interact = T,
output_dir = "/path/")
utils::sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] EpiCompare_1.11.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.0
## [2] BiocIO_1.17.0
## [3] bitops_1.0-9
## [4] ggplotify_0.1.2
## [5] filelock_1.0.3
## [6] tibble_3.2.1
## [7] R.oo_1.26.0
## [8] XML_3.99-0.17
## [9] lifecycle_1.0.4
## [10] lattice_0.22-6
## [11] magrittr_2.0.3
## [12] plotly_4.10.4
## [13] sass_0.4.9
## [14] rmarkdown_2.28
## [15] jquerylib_0.1.4
## [16] yaml_2.3.10
## [17] plotrix_3.8-4
## [18] ggtangle_0.0.3
## [19] cowplot_1.1.3
## [20] DBI_1.2.3
## [21] RColorBrewer_1.1-3
## [22] lubridate_1.9.3
## [23] abind_1.4-8
## [24] zlibbioc_1.53.0
## [25] GenomicRanges_1.59.0
## [26] purrr_1.0.2
## [27] R.utils_2.12.3
## [28] BiocGenerics_0.53.0
## [29] RCurl_1.98-1.16
## [30] yulab.utils_0.1.7
## [31] rappdirs_0.3.3
## [32] GenomeInfoDbData_1.2.13
## [33] IRanges_2.41.0
## [34] S4Vectors_0.45.0
## [35] enrichplot_1.27.0
## [36] ggrepel_0.9.6
## [37] tidytree_0.4.6
## [38] ChIPseeker_1.43.0
## [39] codetools_0.2-20
## [40] DelayedArray_0.33.0
## [41] DOSE_4.1.0
## [42] tidyselect_1.2.1
## [43] aplot_0.2.3
## [44] UCSC.utils_1.3.0
## [45] farver_2.1.2
## [46] matrixStats_1.4.1
## [47] stats4_4.5.0
## [48] BiocFileCache_2.15.0
## [49] GenomicAlignments_1.43.0
## [50] jsonlite_1.8.9
## [51] tools_4.5.0
## [52] treeio_1.31.0
## [53] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [54] Rcpp_1.0.13
## [55] glue_1.8.0
## [56] SparseArray_1.7.0
## [57] xfun_0.48
## [58] qvalue_2.39.0
## [59] MatrixGenerics_1.19.0
## [60] GenomeInfoDb_1.43.0
## [61] dplyr_1.1.4
## [62] withr_3.0.2
## [63] BiocManager_1.30.25
## [64] fastmap_1.2.0
## [65] boot_1.3-31
## [66] fansi_1.0.6
## [67] caTools_1.18.3
## [68] digest_0.6.37
## [69] timechange_0.3.0
## [70] R6_2.5.1
## [71] mime_0.12
## [72] gridGraphics_0.5-1
## [73] seqPattern_1.39.0
## [74] colorspace_2.1-1
## [75] GO.db_3.20.0
## [76] gtools_3.9.5
## [77] RSQLite_2.3.7
## [78] R.methodsS3_1.8.2
## [79] b64_0.1.3
## [80] utf8_1.2.4
## [81] tidyr_1.3.1
## [82] generics_0.1.3
## [83] data.table_1.16.2
## [84] rtracklayer_1.67.0
## [85] bsplus_0.1.4
## [86] httr_1.4.7
## [87] htmlwidgets_1.6.4
## [88] S4Arrays_1.7.0
## [89] downloadthis_0.4.1
## [90] pkgconfig_2.0.3
## [91] gtable_0.3.6
## [92] blob_1.2.4
## [93] impute_1.81.0
## [94] XVector_0.47.0
## [95] htmltools_0.5.8.1
## [96] bookdown_0.41
## [97] fgsea_1.33.0
## [98] scales_1.3.0
## [99] Biobase_2.67.0
## [100] png_0.1-8
## [101] ggfun_0.1.7
## [102] knitr_1.48
## [103] tzdb_0.4.0
## [104] reshape2_1.4.4
## [105] rjson_0.2.23
## [106] nlme_3.1-166
## [107] curl_5.2.3
## [108] cachem_1.1.0
## [109] stringr_1.5.1
## [110] BiocVersion_3.21.1
## [111] KernSmooth_2.23-24
## [112] parallel_4.5.0
## [113] AnnotationDbi_1.69.0
## [114] restfulr_0.0.15
## [115] pillar_1.9.0
## [116] grid_4.5.0
## [117] vctrs_0.6.5
## [118] gplots_3.2.0
## [119] dbplyr_2.5.0
## [120] evaluate_1.0.1
## [121] magick_2.8.5
## [122] tinytex_0.53
## [123] readr_2.1.5
## [124] GenomicFeatures_1.59.0
## [125] cli_3.6.3
## [126] compiler_4.5.0
## [127] Rsamtools_2.23.0
## [128] rlang_1.1.4
## [129] crayon_1.5.3
## [130] labeling_0.4.3
## [131] plyr_1.8.9
## [132] fs_1.6.4
## [133] stringi_1.8.4
## [134] genomation_1.39.0
## [135] viridisLite_0.4.2
## [136] gridBase_0.4-7
## [137] BiocParallel_1.41.0
## [138] munsell_0.5.1
## [139] Biostrings_2.75.0
## [140] lazyeval_0.2.2
## [141] GOSemSim_2.33.0
## [142] Matrix_1.7-1
## [143] BSgenome_1.75.0
## [144] hms_1.1.3
## [145] patchwork_1.3.0
## [146] bit64_4.5.2
## [147] ggplot2_3.5.1
## [148] KEGGREST_1.47.0
## [149] SummarizedExperiment_1.37.0
## [150] highr_0.11
## [151] AnnotationHub_3.15.0
## [152] igraph_2.1.1
## [153] memoise_2.0.1
## [154] bslib_0.8.0
## [155] ggtree_3.15.0
## [156] fastmatch_1.1-4
## [157] bit_4.5.0
## [158] ape_5.8