This vignette describes how to use spatzie to identify pairs of transcription factors whose sequence motifs (that describe their binding sites) are co-enriched in enhancers and promoters that interact with each other. ChIA-PET (Fullwood and Ruan 2009), HiChIP (Mumbach et al. 2016) or Hi-C(Lieberman-Aiden et al. 2009) are molecular biology assays commonly used to investigate long-range genomic interactions and the data they generate, once properly processed (BEDPE format), serves as input to spatzie co-enrichment analyses. In this case we used interactions data in BEDPE format based on a ChIA-PET assay targeting the transcription factor YY1.
Interactions data in BEDPE format is a tab-separated file, where each line describes one interaction between two anchors, i.e., two regions of the genome that are potentially far away from each other.
In order to find motif pairs co-enriched in enhancers and promoters that interact with each other, we first need to annotate all interaction anchors and discard interactions that are not between enhancers and promoters.
The interactions data was aligned to the mouse genome assembly
mm9 (i.e., the coordinates in the BEDPE file are mm9
genome coordinates). The annotation package
BSgenome.Mmusculus.UCSC.mm9
contains the genomic sequence,
which we need in order to detect transcription factor motifs in the
interaction anchor regions.
TxDb.Mmusculus.UCSC.mm9.knownGene
contains promoter
annotations.
genome_id <- "BSgenome.Mmusculus.UCSC.mm9"
if (!(genome_id %in% rownames(utils::installed.packages()))) {
BiocManager::install(genome_id, update = FALSE, ask = FALSE)
}
genome <- BSgenome::getBSgenome(genome_id)
txdb_id <- "TxDb.Mmusculus.UCSC.mm9.knownGene"
if (!(txdb_id %in% rownames(utils::installed.packages()))) {
BiocManager::install(txdb_id, update = FALSE, ask = FALSE)
}
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene::TxDb.Mmusculus.UCSC.mm9.knownGene
ensembl_data_set <- "mmusculus_gene_ensembl"
gene_symbol <- "mgi_symbol"
For this vignette we load toy BEDPE example data from a ChIA-PET experiment in murine embryonic stem cells.
yy1_interactions_file <- system.file("extdata/yy1_interactions.bedpe.gz",
package = "spatzie")
yy1_interactions <- GenomicInteractions::makeGenomicInteractionsFromFile(
yy1_interactions_file,
type = "bedpe",
experiment_name = "yy1",
description = "mESC yy1 chr1")
length(yy1_interactions)
## [1] 40000
Unless the interactions data was preprocessed in such a way to include only enhancers in one anchor and only promoters in the other (or any other meaningful grouping), it is imperative to filter out non-enhancer-promoter interactions. To do this, we annotate interaction anchors as either close to a promoter or promoter-distal (i.e., putative enhancer).
Here, promoter regions are loaded, including the surrounding 2500 base pairs.
promoter_ranges <- GenomicFeatures::promoters(txdb,
upstream = 2500,
downstream = 2500,
columns = c("tx_name", "gene_id"))
# trims out-of-bound ranges located on non-circular sequences
promoter_ranges <- GenomicRanges::trim(promoter_ranges)
# remove duplicate promoters from transcript isoforms
promoter_ranges <- BiocGenerics::unique(promoter_ranges)
promoters_df <- as.data.frame(promoter_ranges)
promoters_df$gene_id <- as.character(promoters_df$gene_id)
We use annotateInteractions
from the
GenomicInteractions
package to annotate the interactions
with the promoters obtained in the previous step, thus splitting the
interactions into three groups: (1) interactions between promoter
regions and regions far away from promoters (i.e., putative enhancers),
(2) interactions between two promoter regions, and (3) interactions
between two promoter-distal regions.
As we have seen above, the interactions data contains not only interactions between enhancers and promoters. Here all interactions that are not between enhancers and promoters are discarded and enhancer-promoter interactions are sorted in a way that the promoters are always in anchor 1 and the enhancers are always in anchor 2.
distal_promoter_idx <- GenomicInteractions::isInteractionType(
yy1_interactions, "distal", "promoter")
yy1_pd <- yy1_interactions[distal_promoter_idx]
anchor1 <- GenomicInteractions::anchorOne(yy1_pd)
anchor2 <- GenomicInteractions::anchorTwo(yy1_pd)
promoter_left <- S4Vectors::elementMetadata(anchor1)[, "node.class"] == "promoter"
promoter_right <- S4Vectors::elementMetadata(anchor2)[, "node.class"] == "promoter"
promoter_ranges <- c(anchor1[promoter_left],
anchor2[promoter_right])
enhancer_ranges <- c(anchor2[promoter_left],
anchor1[promoter_right])
yy1_pd <- GenomicInteractions::GenomicInteractions(promoter_ranges,
enhancer_ranges)
For the purpose of this vignette, we use a toy motif database. The
HOCOMOCO motif database (Kulakovskiy et al.
2018) is commonly used, but any motif file compatible with
TFBSTools::readJASPARMatrix()
can be used.
spatzie::scan_motifs()
takes the filtered interactions
data and scans the anchors of each interaction with the motifs of the
provided motif database, usually a set of transcription factor sequence
motifs.
spatzie::filter_motifs()
operates on the object returned
by spatzie::scan_motifs()
and removes motifs that are
present in - in this case - fewer than 40% of the interactions.
spatzie::anchor_pair_enrich()
operates on the object
returned by spatzie::filter_motifs()
and calculates the
co-enrichment of all motif pairs. It supports three methods to determine
co-enrichment, count-based correlation, score-based correlation, and
match-based assocation. For details, see the help page
(?spatzie::anchor_pair_enrich
) or the spatzie paper
(citation("spatzie")
).
yy1_pd_score_corr <- anchor_pair_enrich(yy1_pd_interaction,
method = "score")
pheatmap::pheatmap(-log2(yy1_pd_score_corr$pair_motif_enrich), fontsize = 6)
YY1 binds enhancer and promoter sites providing scaffolding that forms enhancer-promoter interactions in mouse stem cells (Weintraub et al. 2017). As expected, spatzie identified a statistically significant co-occurrence of YY1 motifs indicating this dependency.
When interpreting spatzie results, keep in mind that motif databases such as HOCOMOCO often include groups of transcription factors with highly similar DNA-binding motifs (in this example YY1 and ZF.5), and the putative co-enrichment of one pair of transcription factor binding sites might be explained by another pair with highly similar motifs.
Please note that the motifs and the interactions data used in this vignette are dummy data used for demonstration purposes only.
Most of the functionality of the spatzie package is also offered through the website at https://spatzie.mit.edu.
For a more detailed discussion on spatzie, please have a look at the paper:
spatzie: An R package for identifying significant
transcription factor motif co-enrichment from enhancer-promoter
interactions
Jennifer Hammelman, Konstantin Krismer, and David K. Gifford
Nucleic Acids Research, 2022, gkac036; DOI: https://doi.org/10.1093/nar/gkac036
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.60.0 rtracklayer_1.52.0
## [4] Biostrings_2.60.1 XVector_0.32.0 GenomicRanges_1.44.0
## [7] GenomeInfoDb_1.28.1 IRanges_2.26.0 S4Vectors_0.30.0
## [10] BiocGenerics_0.38.0 spatzie_0.99.6 usethis_2.0.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 rjson_0.2.20 ellipsis_0.3.2
## [4] biovizBase_1.40.0 htmlTable_2.2.1 base64enc_0.1-3
## [7] fs_1.5.0 dichromat_2.0-0 rstudioapi_0.13
## [10] farver_2.1.0 bit64_4.0.5 AnnotationDbi_1.54.1
## [13] fansi_0.5.0 xml2_1.3.2 splines_4.1.0
## [16] R.methodsS3_1.8.1 cachem_1.0.5 knitr_1.33
## [19] Formula_1.2-4 Rsamtools_2.8.0 seqLogo_1.58.0
## [22] annotate_1.70.0 cluster_2.1.2 GO.db_3.13.0
## [25] dbplyr_2.1.1 png_0.1-7 R.oo_1.24.0
## [28] pheatmap_1.0.12 readr_1.4.0 compiler_4.1.0
## [31] httr_1.4.2 backports_1.2.1 lazyeval_0.2.2
## [34] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0
## [37] htmltools_0.5.1.1 prettyunits_1.1.1 tools_4.1.0
## [40] igraph_1.2.6 gtable_0.3.0 glue_1.4.2
## [43] TFMPvalue_0.0.8 GenomeInfoDbData_1.2.6 reshape2_1.4.4
## [46] dplyr_1.0.7 rappdirs_0.3.3 Rcpp_1.0.7
## [49] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2 Biobase_2.52.0 vctrs_0.3.8
## [52] xfun_0.24 CNEr_1.28.0 stringr_1.4.0
## [55] lifecycle_1.0.0 ensembldb_2.16.2 restfulr_0.0.13
## [58] poweRlaw_0.70.6 gtools_3.9.2 InteractionSet_1.20.0
## [61] XML_3.99-0.6 zlibbioc_1.38.0 scales_1.1.1
## [64] VariantAnnotation_1.38.0 ProtGenerics_1.24.0 GenomicInteractions_1.26.0
## [67] hms_1.1.0 MatrixGenerics_1.4.0 SummarizedExperiment_1.22.0
## [70] AnnotationFilter_1.16.0 RColorBrewer_1.1-2 yaml_2.2.1
## [73] curl_4.3.2 memoise_2.0.0 gridExtra_2.3
## [76] ggplot2_3.3.5 biomaRt_2.48.2 rpart_4.1-15
## [79] latticeExtra_0.6-29 stringi_1.6.2 RSQLite_2.2.7
## [82] highr_0.9 BiocIO_1.2.0 checkmate_2.0.0
## [85] GenomicFeatures_1.44.0 caTools_1.18.2 filelock_1.0.2
## [88] BiocParallel_1.26.1 rlang_0.4.11 pkgconfig_2.0.3
## [91] matrixStats_0.59.0 bitops_1.0-7 evaluate_0.14
## [94] pracma_2.3.3 lattice_0.20-44 purrr_0.3.4
## [97] labeling_0.4.2 htmlwidgets_1.5.3 GenomicAlignments_1.28.0
## [100] bit_4.0.4 tidyselect_1.1.1 plyr_1.8.6
## [103] magrittr_2.0.1 R6_2.5.0 generics_0.1.0
## [106] Hmisc_4.5-0 DelayedArray_0.18.0 DBI_1.1.1
## [109] pillar_1.6.1 foreign_0.8-81 survival_3.2-11
## [112] KEGGREST_1.32.0 RCurl_1.98-1.3 nnet_7.3-16
## [115] tibble_3.1.2 crayon_1.4.1 utf8_1.2.1
## [118] BiocFileCache_2.0.0 rmarkdown_2.9 jpeg_0.1-8.1
## [121] progress_1.2.2 TFBSTools_1.30.0 grid_4.1.0
## [124] data.table_1.14.0 blob_1.2.1 digest_0.6.27
## [127] xtable_1.8-4 R.utils_2.10.1 munsell_0.5.0
## [130] DirichletMultinomial_1.34.0 motifmatchr_1.14.0 Gviz_1.36.2