fgsea
is an R-package for fast preranked gene set
enrichment analysis (GSEA). This package allows to quickly and
accurately calculate arbitrarily low GSEA P-values for a collection of
gene sets. P-value estimation is based on an adaptive multi-level split
Monte-Carlo scheme. See the preprint for
algorithmic details.
Loading example pathways and gene-level statistics and setting random seed:
Running fgsea:
The resulting table contains enrichment scores and p-values:
## pathway pval padj
## <char> <num> <num>
## 1: 5990979_Cell_Cycle,_Mitotic 6.690481e-27 3.920622e-24
## 2: 5990980_Cell_Cycle 3.312565e-26 9.705816e-24
## 3: 5991851_Mitotic_Prometaphase 8.470173e-19 1.654507e-16
## 4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 2.176649e-18 3.188791e-16
## 5: 5991454_M_Phase 1.873997e-14 2.196325e-12
## 6: 5991599_Separation_of_Sister_Chromatids 8.733223e-14 8.529448e-12
## log2err ES NES size leadingEdge
## <num> <num> <num> <int> <list>
## 1: 1.3422338 0.5594755 2.769070 317 66336,66977,12442,107995,66442,12571,...
## 2: 1.3267161 0.5388497 2.705894 369 66336,66977,12442,107995,66442,19361,...
## 3: 1.1239150 0.7253270 2.972690 82 66336,66977,12442,107995,66442,52276,...
## 4: 1.1053366 0.7347987 2.957518 74 66336,66977,12442,107995,66442,52276,...
## 5: 0.9759947 0.5576247 2.554076 173 66336,66977,12442,107995,66442,52276,...
## 6: 0.9545416 0.6164600 2.670030 116 66336,66977,107995,66442,52276,67629,...
As you can see from the warning, fgsea
has a default
lower bound eps=1e-10
for estimating P-values. If you need
to estimate P-value more accurately, you can set the eps
argument to zero in the fgsea
function.
fgseaRes <- fgsea(pathways = examplePathways,
stats = exampleRanks,
eps = 0.0,
minSize = 15,
maxSize = 500)
head(fgseaRes[order(pval), ])
## pathway pval padj
## <char> <num> <num>
## 1: 5990980_Cell_Cycle 2.535645e-26 1.485888e-23
## 2: 5990979_Cell_Cycle,_Mitotic 9.351994e-26 2.740134e-23
## 3: 5991851_Mitotic_Prometaphase 3.633805e-19 7.098033e-17
## 4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 2.077985e-17 3.044248e-15
## 5: 5991454_M_Phase 2.251818e-14 2.639131e-12
## 6: 5991502_Mitotic_Metaphase_and_Anaphase 3.196758e-14 3.122167e-12
## log2err ES NES size leadingEdge
## <num> <num> <num> <int> <list>
## 1: 1.3344975 0.5388497 2.664606 369 66336,66977,12442,107995,66442,19361,...
## 2: 1.3188888 0.5594755 2.740246 317 66336,66977,12442,107995,66442,12571,...
## 3: 1.1330899 0.7253270 2.926512 82 66336,66977,12442,107995,66442,52276,...
## 4: 1.0768682 0.7347987 2.920436 74 66336,66977,12442,107995,66442,52276,...
## 5: 0.9759947 0.5576247 2.547515 173 66336,66977,12442,107995,66442,52276,...
## 6: 0.9653278 0.6052907 2.639370 123 66336,66977,107995,66442,52276,67629,...
One can make an enrichment plot for a pathway:
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
exampleRanks) + labs(title="Programmed Cell Death")
Or make a table plot for a bunch of selected pathways:
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
gseaParam=0.5)
From the plot above one can see that there are very similar pathways
in the table (for example
5991502_Mitotic_Metaphase_and_Anaphase
and
5991600_Mitotic_Anaphase
). To select only independent
pathways one can use collapsePathways
function:
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.01],
examplePathways, exampleRanks)
mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][
order(-NES), pathway]
plotGseaTable(examplePathways[mainPathways], exampleRanks, fgseaRes,
gseaParam = 0.5)
To save the results in a text format data:table::fwrite
function can be used:
To make leading edge more human-readable it can be converted using
mapIdsList
(similar to AnnotationDbi::mapIds
)
function and a corresponding database (here org.Mm.eg.db
for mouse):
Also, fgsea
is parallelized using
BiocParallel
package. By default the first registered
backend returned by bpparam()
is used. To tweak the
parallelization one can either specify BPPARAM
parameter
used for bplapply
of set nproc
parameter,
which is a shorthand for setting
BPPARAM=MulticoreParam(workers = nproc)
.
For convenience there is reactomePathways
function that
obtains pathways from Reactome for given set of genes. Package
reactome.db
is required to be installed.
pathways <- reactomePathways(names(exampleRanks))
fgseaRes <- fgsea(pathways, exampleRanks, maxSize=500)
head(fgseaRes)
## pathway pval
## <char> <num>
## 1: 5-Phosphoribose 1-diphosphate biosynthesis 0.88050314
## 2: A tetrasaccharide linker sequence is required for GAG synthesis 0.49554367
## 3: ABC transporters in lipid homeostasis 0.18160920
## 4: ABC-family proteins mediated transport 0.40845070
## 5: ABO blood group biosynthesis 0.97703549
## 6: ADP signalling through P2Y purinoceptor 1 0.01208293
## padj log2err ES NES size leadingEdge
## <num> <num> <num> <num> <int> <list>
## 1: 0.9496124 0.05367696 0.4267378 0.6746132 2 328099, ....
## 2: 0.7727961 0.07362127 0.3755168 0.9660172 10 14733, 2....
## 3: 0.5089750 0.15631240 -0.4385385 -1.2574076 12 19299, 2....
## 4: 0.7289209 0.07687367 0.2614189 1.0268252 66 17463, 2....
## 5: 0.9907460 0.04870109 0.5120427 0.6902345 1 14344
## 6: 0.1020103 0.38073040 0.6097588 1.7826019 17 14696, 1....
One can also start from .rnk
and .gmt
files
as in original GSEA:
rnk.file <- system.file("extdata", "naive.vs.th1.rnk", package="fgsea")
gmt.file <- system.file("extdata", "mouse.reactome.gmt", package="fgsea")
Loading ranks:
ranks <- read.table(rnk.file,
header=TRUE, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$t, ranks$ID)
str(ranks)
## Named num [1:12000] -63.3 -49.7 -43.6 -41.5 -33.3 ...
## - attr(*, "names")= chr [1:12000] "170942" "109711" "18124" "12775" ...
Loading pathways:
## List of 6
## $ 1221633_Meiotic_Synapsis : chr [1:64] "12189" "13006" "15077" "15078" ...
## $ 1368092_Rora_activates_gene_expression : chr [1:9] "11865" "12753" "12894" "18143" ...
## $ 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression : chr [1:16] "11865" "11998" "12753" "12952" ...
## $ 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane : chr [1:55] "11461" "11465" "11651" "11652" ...
## $ 186574_Endocrine-committed_Ngn3+_progenitor_cells : chr [1:4] "18012" "18088" "18506" "53626"
## $ 186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells: chr [1:4] "11925" "15205" "21410" "246086"
And running fgsea:
## pathway
## <char>
## 1: 1221633_Meiotic_Synapsis
## 2: 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane
## 3: 442533_Transcriptional_Regulation_of_Adipocyte_Differentiation_in_3T3-L1_Pre-adipocytes
## 4: 508751_Circadian_Clock
## 5: 5334727_Mus_musculus_biological_processes
## 6: 573389_NoRC_negatively_regulates_rRNA_expression
## pval padj log2err ES NES size leadingEdge
## <num> <num> <num> <num> <num> <int> <list>
## 1: 0.5483333 0.7251271 0.06523531 0.2885754 0.9412385 27 15270, 1....
## 2: 0.6857610 0.8336764 0.05378728 0.2387284 0.8387184 39 17918, 1....
## 3: 0.1362468 0.3150882 0.19381330 -0.3640706 -1.3431389 31 76199, 1....
## 4: 0.7779661 0.8806609 0.04959020 0.2516324 0.7382131 17 20893, 59027
## 5: 0.3884892 0.5913108 0.07511816 0.2469065 1.0473836 106 60406, 1....
## 6: 0.4101695 0.6090340 0.08085892 0.3607407 1.0583037 17 60406, 20018
fgsea
package also contains a function called
fora
for over-representation analysis based enrichment
tests, based on the hypergeometric test.
fora
requires a foreground set of genes of interest, a
background set consisting of all robustly detected genes (also called
universe), and some pathways.
In the following example, the foreground (fg
) consists
of the 500 genes with the highest stat
value.
fg <- names(head(exampleRanks[order(exampleRanks, decreasing=TRUE)],500))
bg <- names(exampleRanks)
foraRes <- fora(genes=fg, universe=bg, pathways=examplePathways)
head(foraRes)
## pathway pval padj
## <char> <num> <num>
## 1: 5990980_Cell_Cycle 2.498582e-22 3.547986e-19
## 2: 5990979_Cell_Cycle,_Mitotic 6.527769e-20 4.634716e-17
## 3: 5991851_Mitotic_Prometaphase 1.108012e-17 5.244589e-15
## 4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 1.096637e-15 3.893062e-13
## 5: 5991454_M_Phase 8.051967e-13 2.286759e-10
## 6: 5991757_RHO_GTPases_Activate_Formins 4.101960e-12 9.707972e-10
## foldEnrichment overlap size overlapGenes
## <num> <int> <int> <list>
## 1: 4.097561 63 369 11799, 1....
## 2: 4.164038 55 317 11799, 1....
## 3: 7.902439 27 82 11799, 1....
## 4: 7.783784 24 74 11799, 1....
## 5: 4.439306 32 173 11799, 1....
## 6: 6.461538 21 78 11799, 1....
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Mm.eg.db_3.20.0 AnnotationDbi_1.69.0 IRanges_2.41.2
## [4] S4Vectors_0.45.2 Biobase_2.67.0 BiocGenerics_0.53.3
## [7] generics_0.1.3 BiocParallel_1.41.0 ggplot2_3.5.1
## [10] data.table_1.16.4 fgsea_1.33.2
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 fastmatch_1.1-4 gtable_0.3.6
## [4] xfun_0.49 bslib_0.8.0 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.5.0 parallel_4.5.0
## [10] tibble_3.2.1 RSQLite_2.3.9 blob_1.2.4
## [13] pkgconfig_2.0.3 Matrix_1.7-1 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.13 compiler_4.5.0 farver_2.1.2
## [19] Biostrings_2.75.3 munsell_0.5.1 codetools_0.2-20
## [22] GenomeInfoDb_1.43.2 htmltools_0.5.8.1 sass_0.4.9
## [25] yaml_2.3.10 pillar_1.10.0 crayon_1.5.3
## [28] jquerylib_0.1.4 cachem_1.1.0 tidyselect_1.2.1
## [31] digest_0.6.37 dplyr_1.1.4 labeling_0.4.3
## [34] cowplot_1.1.3 fastmap_1.2.0 grid_4.5.0
## [37] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [40] reactome.db_1.89.0 withr_3.0.2 scales_1.3.0
## [43] UCSC.utils_1.3.0 bit64_4.5.2 rmarkdown_2.29
## [46] XVector_0.47.0 httr_1.4.7 bit_4.5.0.1
## [49] png_0.1-8 memoise_2.0.1 evaluate_1.0.1
## [52] knitr_1.49 rlang_1.1.4 Rcpp_1.0.13-1
## [55] glue_1.8.0 DBI_1.2.3 jsonlite_1.8.9
## [58] R6_2.5.1 zlibbioc_1.53.0