Contents

1 Introduction to rrvgo

Gene Ontologies (GO) are often used to guide the interpretation of high-throughput omics experiments, with lists of differentially regulated genes being summarized into sets of genes with a common functional representation. Due to the hierachical nature of Gene Ontologies, the resulting lists of enriched sets are usually redundant and difficult to interpret.

rrvgo aims at simplifying the redundance of GO sets by grouping similar terms based on their semantic similarity. It also provides some plots to help with interpreting the summarized terms.

This software is heavily influenced by REVIGO. It mimics a good part of its core functionality, and even some of the outputs are similar. Without aims to compete, rrvgo tries to offer a programatic interface using available annotation databases and semantic similarity methods implemented in the Bioconductor project.

2 Using rrvgo

2.1 Getting started

Starting with a list of genes of interest (eg. coming from a differential expression analysis), apply any method for the identification of eneriched GO terms (see GOStats or GSEA).

rrvgo does not care about genes, but GO terms. The input is a vector of enriched GO terms, along with (recommended, but not mandatory) a vector of scores. If scores are not provided, rrvgo takes the GO term (set) size as a score, thus favoring broader terms.

2.2 Calculating the similarity matrix and reducing GO terms

First step is to get the similarity matrix between terms. The function calculateSimMatrix takes a list of GO terms for which the semantic simlarity is to be calculated, an OrgDb object for an organism, the ontology of interest and the method to calculate the similarity scores.

library(rrvgo)
go_analysis <- read.delim(system.file("extdata/example.txt", package="rrvgo"))
simMatrix <- calculateSimMatrix(go_analysis$ID,
                                orgdb="org.Hs.eg.db",
                                ont="BP",
                                method="Rel")

The semdata parameter (see ?calculateSimMatrix) is not mandatory as it is calculated on demand. If the function needs to run several times with the same organism, it’s advisable to save the GOSemSim::godata(orgdb, ont=ont) object, in order to reuse it between calls and speedup the calculation of the similarity matrix.

From the similarity matrix one can group terms based on similarity. rrvgo provides the reduceSimMatrix function for that. It takes as arguments i) the similarity matrix, ii) an optional named vector of scores associated to each GO term, iii) a similarity threshold used for grouping terms, and iv) an orgdb object.

scores <- setNames(-log10(go_analysis$qvalue), go_analysis$ID)
reducedTerms <- reduceSimMatrix(simMatrix,
                                scores,
                                threshold=0.7,
                                orgdb="org.Hs.eg.db")

reduceSimMatrix selects as the group representative the term with the higher score within the group. In case the vector of scores is not available, reduceSimMatrix will get the GO term size from the OrgDb object and use it as the score, thus favoring broader terms. Please note that scores are interpreted in the direction that higher are better, therefore if you use p-values as scores, minus log-transform them before.

Higher thresholds force higher similarity between terms of a groups, resulting in more groups containing less similar terms.

2.3 Plotting and interpretation

rrvgo provides several methods for plotting and interpreting the results.

2.3.1 Similarity matrix heatmap

Plot similarity matrix as a heatmap, with clustering of columns of rows turned on by default (thus arranging together similar terms).

heatmapPlot(simMatrix,
            reducedTerms,
            annotateParent=TRUE,
            annotationLabel="parentTerm",
            fontsize=6)

The function internally uses pheatmap, and further parameters can be passed to this function.

2.3.2 Scatter plot depicting groups and distance between terms

Plot GO terms as scattered points. Distances between points represent the similarity between terms, and axes are the first 2 components of applying a PCoA to the (di)similarity matrix. Size of the point represents the provided scores or, in its absence, the number of genes the GO term contains.

scatterPlot(simMatrix, reducedTerms)

2.3.3 Treemap plot

Treemaps are space-filling visualization of hierarchical structures. The terms are grouped (colored) based on their parent, and the space used by the term is proportional to the score. Treemaps can help with the interpretation of the summarized results and also comparing differents sets of GO terms.

treemapPlot(reducedTerms)
treemap

treemap

The function internally uses treemap, and further parameters can be passed to this function.

2.3.4 Word cloud

Word clouds are visualizations which reproduce a text putting emphasis to words which appear frequently in a text. They can help to identify processes and functions that happen more commonly in a set of enriched GO terms, as well as comparing between different sets.

wordcloudPlot(reducedTerms, min.freq=1, colors="black")

The function internally uses wrodcloud, and further parameters can be passed to this function.

2.4 Shiny app

To make the software more accessible to a non-technical audience, rrvgo packages a shiny app which can be accessed calling the shiny_rrvgo() function from the R console.

rrvgo::shiny_rrvgo()
shiny_app

shiny_app

The app offers interactive access to the plots and tables calculated by rrvgo.

3 Currently supported

3.1 Similarity methods

All similarity measures available are those implemented in the GOSemSim package, namely the Resnik, Lin, Relevance, Jiang and Wang methods. See the Semantic Similarity Measurement Based on GO section from the GOSeSim documentation for more details.

3.2 Organisms

Bioconductor current provides OrgDb objects for 20 species provided by the following packages:

Package Organism
org.Ag.eg.db Anopheles
org.At.tair.db Arabidopsis
org.Bt.eg.db Bovine
org.Ce.eg.db Worm
org.Cf.eg.db Canine
org.Dm.eg.db Fly
org.Dr.eg.db Zebrafish
org.EcK12.eg.db E coli strain K12
org.EcSakai.eg.db E coli strain Sakai
org.Gg.eg.db Chicken
org.Hs.eg.db Human
org.Mm.eg.db Mouse
org.Mmu.eg.db Rhesus
org.Mxanthus.db Myxococcus xanthus DK 1622
org.Pf.plasmo.db Malaria
org.Pt.eg.db Chimp
org.Rn.eg.db Rat
org.Sc.sgd.db Yeast
org.Ss.eg.db Pig
org.Xl.eg.db Xenopus

If the organism is not supported in Bioconductor, you can still build your own OrgDb object usign the AnnotationForge package and rendering the necessary data for semantic similarity using the GOSemSim package with:

my_new_fancy_orgdb_object <- 'org.Zz.eg.db'
hsGO <- GOSemSim::godata(my_new_fancy_orgdb_object, ont="MF")

3.3 Gene Ontologies

One of Biologiocal Process (BP), Molecular Function (MF) or Cellular Compartment (CC).

4 Demo data

Taken as is from the DOSE package, which was derived from the R package breastCancerMAINZ. It contains 200 samples with breast cancer at different grades (I, II and III). The dataset basically contains log2 ratios of the geometric means of grade III vs. grade I samples ( 34 vs. 29 repectively).

5 Citing rrvgo

Please consider citing rrvgo if used in support of your own research:

citation("rrvgo")
## 
##   Sergi Sayols (2020). rrvgo: a Bioconductor package to reduce and
##   visualize Gene Ontology terms
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {rrvgo: a Bioconductor package to reduce and visualize Gene Ontology terms},
##     author = {Sergi Sayols},
##     year = {2020},
##     url = {https://ssayols.github.io/rrvgo},
##   }

5.1 Reporting problems or bugs

If you run into problems using rrvgo, the Bioconductor Support site is a good first place to ask for help. If you think there is a bug or an unreported feature, you can report it using the rrvgo github site.

5.2 Session info

The following package and versions were used in the production of this vignette.

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rrvgo_1.6.0      knitr_1.36       BiocStyle_2.22.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.54.0         httr_1.4.2             sass_0.4.0            
##  [4] bit64_4.0.5            jsonlite_1.7.2         bslib_0.3.1           
##  [7] shiny_1.7.1            assertthat_0.2.1       highr_0.9             
## [10] BiocManager_1.30.16    stats4_4.1.1           blob_1.2.2            
## [13] GenomeInfoDbData_1.2.7 slam_0.1-48            yaml_2.2.1            
## [16] ggrepel_0.9.1          pillar_1.6.4           RSQLite_2.2.8         
## [19] glue_1.4.2             digest_0.6.28          RColorBrewer_1.1-2    
## [22] promises_1.2.0.1       XVector_0.34.0         colorspace_2.0-2      
## [25] htmltools_0.5.2        httpuv_1.6.3           tm_0.7-8              
## [28] pkgconfig_2.0.3        pheatmap_1.0.12        magick_2.7.3          
## [31] bookdown_0.24          zlibbioc_1.40.0        purrr_0.3.4           
## [34] xtable_1.8-4           GO.db_3.14.0           scales_1.1.1          
## [37] later_1.3.0            tibble_3.1.5           KEGGREST_1.34.0       
## [40] farver_2.1.0           generics_0.1.1         IRanges_2.28.0        
## [43] ggplot2_3.3.5          ellipsis_0.3.2         cachem_1.0.6          
## [46] BiocGenerics_0.40.0    NLP_0.2-1              magrittr_2.0.1        
## [49] crayon_1.4.1           mime_0.12              memoise_2.0.0         
## [52] evaluate_0.14          fansi_0.5.0            xml2_1.3.2            
## [55] data.table_1.14.2      treemap_2.4-3          tools_4.1.1           
## [58] org.Hs.eg.db_3.14.0    gridBase_0.4-7         lifecycle_1.0.1       
## [61] stringr_1.4.0          S4Vectors_0.32.0       munsell_0.5.0         
## [64] AnnotationDbi_1.56.0   Biostrings_2.62.0      compiler_4.1.1        
## [67] jquerylib_0.1.4        GenomeInfoDb_1.30.0    rlang_0.4.12          
## [70] grid_4.1.1             RCurl_1.98-1.5         igraph_1.2.7          
## [73] labeling_0.4.2         bitops_1.0-7           rmarkdown_2.11        
## [76] gtable_0.3.0           codetools_0.2-18       DBI_1.1.1             
## [79] R6_2.5.1               dplyr_1.0.7            fastmap_1.1.0         
## [82] bit_4.0.4              utf8_1.2.2             GOSemSim_2.20.0       
## [85] stringi_1.7.5          parallel_4.1.1         Rcpp_1.0.7            
## [88] vctrs_0.3.8            png_0.1-7              wordcloud_2.6         
## [91] tidyselect_1.1.1       xfun_0.27