JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 157 PFMs were updated (125 for vertebrates, 28 for plants, and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles.
The easiest way to use the JASPAR2020 data package (Fornes et al. 2019) is by TFBSTools
package interface (Tan and Lenhard 2016), which provides functions to retrieve and manipulate data from the JASPAR database. This vignette demonstrates how to use those functions.
library(JASPAR2020)
library(TFBSTools)
Matrices from JASPAR can be retrieved using either getMatrixByID
or getMatrixByName
function by providing a matrix ID or a matrix name from JASPAR, respectively. These functions accept either a single element as the ID/name parameter or a vector of values. The former case returns a PFMatrix
object, while the later one returns a PFMatrixList
with multiple PFMatrix
objects.
## the user assigns a single matrix ID to the argument ID
pfm <- getMatrixByID(JASPAR2020, ID = "MA0139.1")
## the function returns a PFMatrix object
pfm
#> An object of class PFMatrix
#> ID: MA0139.1
#> Name: CTCF
#> Matrix Class: C2H2 zinc finger factors
#> strand: +
#> Tags:
#> $alias
#> [1] "-"
#>
#> $description
#> [1] "CCCTC-binding factor (zinc finger protein)"
#>
#> $family
#> [1] "More than 3 adjacent zinc finger factors"
#>
#> $medline
#> [1] "17512414"
#>
#> $remap_tf_name
#> [1] "CTCF"
#>
#> $symbol
#> [1] "CTCF"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "133"
#>
#> $type
#> [1] "ChIP-seq"
#>
#> $unibind
#> [1] "1"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 9606
#> "Homo sapiens"
#>
#> $acc
#> [1] "P49711"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> A 87 167 281 56 8 744 40 107 851 5 333 54 12 56
#> C 291 145 49 800 903 13 528 433 11 0 3 12 0 8
#> G 76 414 449 21 0 65 334 48 32 903 566 504 890 775
#> T 459 187 134 36 2 91 11 324 18 3 9 341 8 71
#> [,15] [,16] [,17] [,18] [,19]
#> A 104 372 82 117 402
#> C 733 13 482 322 181
#> G 5 507 307 73 266
#> T 67 17 37 396 59
The user can utilise the PFMatrix object for further analysis and visualisation. Here is an example of how to plot a sequence logo of a given matrix using functions available in TFBSTools
package.
seqLogo(toICM(pfm))
## the user assigns multiple matrix IDs to the argument ID
pfmList <- getMatrixByID(JASPAR2020, ID=c("MA0139.1", "MA1102.1"))
## the function returns a PFMatrix object
pfmList
#> PFMatrixList of length 2
#> names(2): MA0139.1 MA1102.1
## PFMatrixList can be subsetted to extract enclosed PFMatrix objects
pfmList[[2]]
#> An object of class PFMatrix
#> ID: MA1102.1
#> Name: CTCFL
#> Matrix Class: C2H2 zinc finger factors
#> strand: +
#> Tags:
#> $centrality_logp
#> [1] "-7211.51"
#>
#> $family
#> [1] "More than 3 adjacent zinc finger factors"
#>
#> $medline
#> [1] "26268681"
#>
#> $remap_tf_name
#> [1] "CTCFL"
#>
#> $source
#> [1] "29126285"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "1"
#>
#> $type
#> [1] "ChIP-seq"
#>
#> $unibind
#> [1] "1"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 9606
#> "Homo sapiens"
#>
#> $acc
#> [1] "Q8NI51"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> A 627 3344 909 499 8422 76 568 266 123 127 117 4022 714 1329
#> C 6915 1487 4650 6703 166 93 145 325 179 136 8625 730 3772 3392
#> G 860 2520 2898 535 152 8596 8035 7716 8447 8380 37 3702 3448 1810
#> T 423 1474 368 1088 85 60 77 518 76 182 46 371 891 2294
getMatrixByName
retrieves matrices by name. If multiple matrix names are supplied, the function returns a PFMatrixList object.
pfm <- getMatrixByName(JASPAR2020, name="Arnt")
pfm
#> An object of class PFMatrix
#> ID: MA0004.1
#> Name: Arnt
#> Matrix Class: Basic helix-loop-helix factors (bHLH)
#> strand: +
#> Tags:
#> $alias
#> [1] "HIF-1beta,bHLHe2"
#>
#> $description
#> [1] "aryl hydrocarbon receptor nuclear translocator"
#>
#> $family
#> [1] "PAS domain factors"
#>
#> $medline
#> [1] "7592839"
#>
#> $remap_tf_name
#> [1] "ARNT"
#>
#> $symbol
#> [1] "ARNT"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "11"
#>
#> $type
#> [1] "SELEX"
#>
#> $unibind
#> [1] "1"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 10090
#> "Mus musculus"
#>
#> $acc
#> [1] "P53762"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> A 4 19 0 0 0 0
#> C 16 0 20 0 0 0
#> G 0 1 0 20 0 20
#> T 0 0 0 0 20 0
pfmList <- getMatrixByName(JASPAR2020, name=c("Arnt", "Ahr::Arnt"))
pfmList
#> PFMatrixList of length 2
#> names(2): Arnt Ahr::Arnt
The getMatrixSet
function fetches all matrices that match criteria defined by the named arguments, and it returns a PFMatrixList
object.
## select all matrices found in a specific species and constructed from the SELEX
## experiment
opts <- list()
opts[["species"]] <- 9606
opts[["type"]] <- "SELEX"
opts[["all_versions"]] <- TRUE
PFMatrixList <- getMatrixSet(JASPAR2020, opts)
PFMatrixList
#> PFMatrixList of length 48
#> names(48): MA0002.1 MA0003.1 MA0018.1 MA0025.1 ... MA0124.1 MA0130.1 MA0131.1
## retrieve all matrices constructed from SELEX experiment
opts2 <- list()
opts2[["type"]] <- "SELEX"
PFMatrixList2 <- getMatrixSet(JASPAR2020, opts2)
PFMatrixList2
#> PFMatrixList of length 82
#> names(82): MA0004.1 MA0006.1 MA0015.1 MA0016.1 ... MA0588.1 MA0589.1 MA0590.1
Additional details about TFBS matrix analysis can be found in the TFBSTools documantation.
Here is the output of sessionInfo()
on the system on which this document was compiled:
#> R version 4.0.0 alpha (2020-04-07 r78171)
#> Platform: x86_64-apple-darwin17.7.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Users/ka36530_ca/R-stuff/bin/R-4-0/lib/libRblas.dylib
#> LAPACK: /Users/ka36530_ca/R-stuff/bin/R-4-0/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] TFBSTools_1.27.0 JASPAR2020_0.99.10 BiocStyle_2.17.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.1 Biobase_2.49.0
#> [3] bit64_0.9-7 R.utils_2.9.2
#> [5] gtools_3.8.2 BiocManager_1.30.10
#> [7] stats4_4.0.0 blob_1.2.1
#> [9] BSgenome_1.57.0 GenomeInfoDbData_1.2.3
#> [11] Rsamtools_2.5.1 yaml_2.2.1
#> [13] DirichletMultinomial_1.31.0 pillar_1.4.4
#> [15] RSQLite_2.2.0 lattice_0.20-41
#> [17] glue_1.4.1 digest_0.6.25
#> [19] GenomicRanges_1.41.1 XVector_0.29.1
#> [21] colorspace_1.4-1 R.oo_1.23.0
#> [23] htmltools_0.4.0 Matrix_1.2-18
#> [25] plyr_1.8.6 XML_3.99-0.3
#> [27] pkgconfig_2.0.3 bookdown_0.19
#> [29] zlibbioc_1.35.0 GO.db_3.11.4
#> [31] xtable_1.8-4 purrr_0.3.4
#> [33] scales_1.1.1 pracma_2.2.9
#> [35] BiocParallel_1.23.0 annotate_1.67.0
#> [37] tibble_3.0.1 KEGGREST_1.29.0
#> [39] generics_0.0.2 IRanges_2.23.6
#> [41] ggplot2_3.3.1 ellipsis_0.3.1
#> [43] SummarizedExperiment_1.19.4 TFMPvalue_0.0.8
#> [45] BiocGenerics_0.35.2 magrittr_1.5
#> [47] crayon_1.3.4 poweRlaw_0.70.6
#> [49] memoise_1.1.0 evaluate_0.14
#> [51] R.methodsS3_1.8.0 CNEr_1.25.0
#> [53] tools_4.0.0 hms_0.5.3
#> [55] formatR_1.7 lifecycle_0.2.0
#> [57] matrixStats_0.56.0 stringr_1.4.0
#> [59] S4Vectors_0.27.9 munsell_0.5.0
#> [61] DelayedArray_0.15.1 AnnotationDbi_1.51.0
#> [63] Biostrings_2.57.1 compiler_4.0.0
#> [65] GenomeInfoDb_1.25.0 caTools_1.18.0
#> [67] rlang_0.4.6 grid_4.0.0
#> [69] RCurl_1.98-1.2 bitops_1.0-6
#> [71] rmarkdown_2.2 gtable_0.3.0
#> [73] DBI_1.1.0 reshape2_1.4.4
#> [75] R6_2.4.1 GenomicAlignments_1.25.1
#> [77] knitr_1.28 dplyr_1.0.0
#> [79] seqLogo_1.55.4 rtracklayer_1.49.2
#> [81] bit_1.1-15.2 readr_1.3.1
#> [83] stringi_1.4.6 parallel_4.0.0
#> [85] Rcpp_1.0.4.6 png_0.1-7
#> [87] vctrs_0.3.0 tidyselect_1.1.0
#> [89] xfun_0.14
Fornes, Oriol, Jaime A. Castro-Mondragon, Aziz Khan, Robin van der Lee, Xi Zhang, Richmond Phillip, Bhavi P. Modi, et al. 2019. “JASPAR 2020: update of the open-access database of transcription factor binding profiles.” Nucleic Acid Research. https://doi.org/10.1093/nar/gkz1001.
Tan, Ge, and Boris Lenhard. 2016. “FBSTools: An R/Bioconductor Package for Transcription Factor Binding Site Analysis.” Bioinformatics 32: 1555–6.