Contents

1 Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("roastgsa")

2 Using gene set collections for battery testing

2.1 Gene set collections

Several gene set databases are publicly available and can be used for gene set analysis as part of the screening of bulk data for hypothesis generation. The Hallmark gene set database [1] provides a well-curated gene set list of biological states which can be used to obtain an overall characterization of the data. Other Human Molecular Signatures Database (MSigDB) collections including gene ontology pathways (Biological Process, Cellular Component, Molecular Function), immunologic signature gene sets or regulatory target gene sets can be employed for roastgsa analysis to identify the most relevant expression changes between experimental conditions for highly specific biological functions. These broad collections and sub-collections can be downloaded through https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp. Other public collections that can be employed for battery testing such as the KEGG pathway database (https://www.genome.jp/kegg/pathway.html) [2] or reactome (https://reactome.org/) [3] provide a broad representation of gene sets for human diseases, metabolism or cellular processes among others.

These gene set collections can be considered for roastgsa either by (1) R loading the gene sets in a list object, each element containing the gene identifiers for the testing set or (2) saving a .gmt file with the whole collection in a specific folder, i.e.,

# DO NOT RUN
# gspath = "path/to/folder/of/h.all.v7.2.symbols.gmt"
# gsetsel = "h.all.v7.2.symbols.gmt"

2.2 Usage of Hallmarks in another example with real data

We download publicly available arrays from GEO, accession ‘GSE145603’, which contain LGR5 and POLI double tumor cell populations in colorectal cancer [4]

# DO NOT RUN
# library(GEOquery)
# data <- getGEO('GSE145603')
# normdata <- (data[[1]])
# pd <- pData(normdata)
# pd$group_LGR5 <- pd[["lgr5:ch1"]]

We are interested in screening the hallmarks with the largests changes between LGR5 negative and LGR5 positive samples. The hallmark gene sets are stored in gspath with the file name specified in gsetsel (see specifications above).

The formula, design matrix and the corresponding contrast can be obtained as follows

# DO NOT RUN
# form <- "~ -1 + group_LGR5"
# design <- model.matrix(as.formula(form),pd)
# cont.mat <- data.frame(Lgr5.high_Lgr5.neg = c(1,-1))
# rownames(cont.mat) <- colnames(design)

In microarrays, the expression values for each gene can be measured in several probesets. To perform gene set analysis, we select the probeset with maximum variability for every gene:

# DO NOT RUN
# mads <- apply(exprs(normdata), 1, mad)
# gu <- strsplit(fData(normdata)[["Gene Symbol"]], split=' \\/\\/\\/ ')
# names(gu) <- rownames(fData(normdata))
# gu <- gu[sapply(gu, length)==1]
# gu <- gu[gu!='' & !is.na(gu) & gu!='---']
# ps <- rep(names(gu), sapply(gu, length))
# gs <- unlist(gu)
# pss <- tapply(ps, gs, function(o) names(which.max(mads[o])))
# psgen.mvar <- pss

The roastgsa function is used for competitive gene set analysis testing:

# DO NOT RUN
# roast1 <- roastgsa(exprs(normdata), form = form, covar = pd,
#                    psel = psgen.mvar, contrast = cont.mat[, 1],
#                    gspath = gspath, gsetsel = gsetsel, nrot = 1000,
#                    mccores = 7, set.statistic = "maxmean")
#
# print(roast1)
#

Summary tables can be presented following the roastgsa::htmlrgsa documentation.

3 sessionInfo

sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DESeq2_1.47.0               SummarizedExperiment_1.37.0
##  [3] Biobase_2.67.0              MatrixGenerics_1.19.0      
##  [5] matrixStats_1.4.1           GenomicRanges_1.59.0       
##  [7] GenomeInfoDb_1.43.0         IRanges_2.41.0             
##  [9] S4Vectors_0.45.0            BiocGenerics_0.53.0        
## [11] roastgsa_1.5.0              knitr_1.48                 
## [13] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.48               bslib_0.8.0            
##  [4] ggplot2_3.5.1           caTools_1.18.3          lattice_0.22-6         
##  [7] vctrs_0.6.5             tools_4.5.0             bitops_1.0-9           
## [10] generics_0.1.3          parallel_4.5.0          tibble_3.2.1           
## [13] fansi_1.0.6             highr_0.11              pkgconfig_2.0.3        
## [16] Matrix_1.7-1            KernSmooth_2.23-24      RColorBrewer_1.1-3     
## [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 farver_2.1.2           
## [22] compiler_4.5.0          gplots_3.2.0            tinytex_0.53           
## [25] statmod_1.5.0           munsell_0.5.1           codetools_0.2-20       
## [28] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.10            
## [31] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
## [34] BiocParallel_1.41.0     cachem_1.1.0            DelayedArray_0.33.1    
## [37] limma_3.63.0            magick_2.8.5            abind_1.4-8            
## [40] gtools_3.9.5            locfit_1.5-9.10         tidyselect_1.2.1       
## [43] digest_0.6.37           dplyr_1.1.4             bookdown_0.41          
## [46] labeling_0.4.3          fastmap_1.2.0           grid_4.5.0             
## [49] colorspace_2.1-1        cli_3.6.3               SparseArray_1.7.0      
## [52] magrittr_2.0.3          S4Arrays_1.7.1          utf8_1.2.4             
## [55] withr_3.0.2             scales_1.3.0            UCSC.utils_1.3.0       
## [58] rmarkdown_2.28          XVector_0.47.0          httr_1.4.7             
## [61] evaluate_1.0.1          rlang_1.1.4             Rcpp_1.0.13            
## [64] glue_1.8.0              BiocManager_1.30.25     jsonlite_1.8.9         
## [67] R6_2.5.1                zlibbioc_1.53.0

4 References

Appendix

[1] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015.

[2] M. Kanehisa et al. KEGG as a reference resource for gene and protein annotation, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, D457–D462, https://doi.org/10.1093/nar/gkv1070

[3] M. Gillespie et al. The reactome pathway knowledgebase 2022, Nucleic Acids Research, 2021;, gkab1028, https://doi.org/10.1093/nar/gkab1028

[4] Morral C, Stanisavljevic J, Hernando-Momblona X, et al. Zonation of Ribosomal DNA Transcription Defines a Stem Cell Hierarchy in Colorectal Cancer. Cell Stem Cell. 2020;26(6):845-861.e12. doi:10.1016/j.stem.2020.04.012