In this example, we will generate a gene module-immune cell signature network using the GmicR pacakge. This package uses WGCNA to compresses high-dimensional expression data into module eigenegenes, which are used with bayesian learning and xCell cell signatures to infer causal relationships between gene modules and cell signatures. Expression data must be normalized (RPKM/FPKM/TPM/RSEM) and annotated with official gene symbols.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install('GmicR')
For macosx users experiencing WGCNA installation errors, try downloading a compiled version from:
For macosx users experiencing installation errors, try downloading OS X binaries from:
For this example, we are downloading microarray expression data provided by the xCell web portal. This dataset contains the expression profiles of twelve different types of human leukocytes from peripheral blood and bone marrow, before and after different treatments.
Detailed information about this dataset is available:
WGCNA is used to for quality control of genes via the goodSamplesGenes function
library(WGCNA)
gsg = goodSamplesGenes(datExpr0, verbose = 3) # columns must be genes
#> Flagging genes and samples with too many missing values...
#> ..step 1
gsg$allOK
#> [1] TRUE
A sampleTree can be used to check for outlier samples. For this example all samples are kept.
For cell signature detection using xCell, the expression data can be written to a csv file. The file can be uploaded at: http://xcell.ucsf.edu
Exps_for_xCell_analysis<-data.frame(t(datExpr), check.names = FALSE)
write.csv(Exps_for_xCell_analysis, file = "Exps_for_xCell_analysis.csv")
Once you have the xCell data processed by http://xcell.ucsf.edu, you will receive an email linking to three text files. Download these files. For GmicR, the “xCell results” file is required for Step 3.
For simplicity, we will carryout WGCNA using 1000 randomly selected genes from 50 randomly selected samples
Auto_WGCNA is a wrapper for WGCNA. Not all options are avaible. For more advanced features please use WGCNA.
library(GmicR)
GMIC_Builder<-Auto_WGCNA(sample_dat,
mergeCutHeight = 0.35, minModuleSize = 10,
deepSplit = 4, networkType = "signed hybrid", TOMType = "unsigned",
corFnc = "bicor", sft_RsquaredCut = 0.85,
reassignThreshold = 1e-06, maxBlockSize = 25000)
#> verify that colnames contain official gene symbols
#> Table for modules and gene counts
Viewing input parameters
GMIC_Builder$Input_Parameters
#> $networkType
#> [1] "signed hybrid"
#>
#> $TOMType
#> [1] "unsigned"
#>
#> $corFnc
#> [1] "bicor"
#>
#> $sft_RsquaredCut
#> [1] 0.85
#>
#> $softPower
#> [1] 5
#>
#> $minModuleSize
#> [1] 10
#>
#> $deepSplit
#> [1] 4
#>
#> $mergeCutHeight
#> [1] 0.35
#>
#> $reassignThreshold
#> [1] 1e-06
#>
#> $maxBlockSize
#> [1] 25000
Soft threshold plot
module clustering
GMIC_Builder$Output_plots$module_clustering
#> Warning in restoreRecordedPlot(x, reloadPkgs): snapshot recorded in different R
#> version (3.5.3)
#> Warning in doTryCatch(return(expr), name, parentenv, handler): snapshot recorded
#> with different graphics engine version (12 - this is version 14)
dendrogram
WGCNA functions intramodularConnectivity and chooseOneHubInEachModule are used to build a dataframe with gene module information.
# Module hubs and Gene influence
GMIC_Builder<-Query_Prep(GMIC_Builder,
calculate_intramodularConnectivity= TRUE,
Find_hubs = TRUE)
#> Allowing multi-threading with up to 72 threads.
head(GMIC_Builder$Query)
#> ref_genes modules hub Freq kTotal kWithin
#> SLC37A4 SLC37A4 0 SLC37A4 100 1.6525368 1.0000000
#> PEX10 PEX10 0 SLC37A4 100 1.6403205 0.8690595
#> GALNT12 GALNT12 0 SLC37A4 100 0.9325671 0.8025879
#> FGF3 FGF3 0 SLC37A4 100 2.2483526 0.7881512
#> ANGEL1 ANGEL1 0 SLC37A4 100 3.4791659 0.7681177
#> PER3 PER3 0 SLC37A4 100 0.8164831 0.7152039
This function constructs a library for gene ontology enrichment, which will be used for module naming with the GO_Module_NameR function.
GMIC_Builder<-GSEAGO_Builder(GMIC_Builder,
species = "Homo sapiens", ontology = "BP", no_cores = 1)
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
GO_Module_NameR will assign a name to each module based on ontology size. A smaller cut off size will generate a more specific term.
This table provides a summary of detected modules. “Freq” indicates the total genes within each module
head(GMIC_Builder$GO_table, n = 4)
#> GO_module_name modules hub Freq
#> ME0 module 0 0 SLC37A4 100
#> ME1 system process 1 SGCD 319
#> ME2 RNA processing 2 ANP32B 187
#> ME3 cellular carbohydrate catabolic process 3 CYP27A1 88
A searchable dataframe is also generated
head(GMIC_Builder$GO_Query, n = 4)
#> modules ref_genes hub Freq kTotal kWithin GO_module_name
#> SLC37A4 0 SLC37A4 SLC37A4 100 1.6525368 1.0000000 module 0
#> PEX10 0 PEX10 SLC37A4 100 1.6403205 0.8690595 module 0
#> GALNT12 0 GALNT12 SLC37A4 100 0.9325671 0.8025879 module 0
#> FGF3 0 FGF3 SLC37A4 100 2.2483526 0.7881512 module 0
For this example, we are using cell signatures provided by the GmicR package, which were generated using the xCell web portal.
This function merges module eigengenes with xCell signatures prior to discretization. Only xCell signatures are supported. Discretization is carried out with bnlearning using “hartemink” method. For detailed information discretization see: http://www.bnlearn.com/documentation/man/preprocessing.html
GMIC_Builder_disc<-Data_Prep(GMIC_Builder,
xCell_Signatures = file_dir,
ibreaks=10, Remove_ME0 = TRUE)
#> ImmuneScore, StromaScore, and MicroenvironmentScore removed
head(GMIC_Builder_disc$disc_data[sample(seq(1,64),4)])
#> CD4_Tcm CD8_Tcells Pericytes Memory_Bcells
#> GSM565337 H H L M
#> GSM565326 M H L M
#> GSM565270 M H L M
#> GSM565341 H H L M
#> GSM565276 H H L M
#> GSM565358 H L L M
Although the default score for this function is Bayesian Dirichlet equivalent score (bde), for this example we will use the Bayesian Dirichlet sparse score (bds). For sparse data, such as the data used in this example, the bds score is better suited: https://arxiv.org/abs/1605.03884.
Once complete, the GMIC network can be viewed using the Gmic_viz shiny app.
GMIC_Final_dir<-system.file("extdata", "GMIC_Final.Rdata",
package = "GmicR", mustWork = TRUE)
load(GMIC_Final_dir)
if(interactive()){
Gmic_viz(GMIC_Final)
}
You can view the entire network or just a subset of nodes. Inverse relationships can be highlighted based on color and/or edge pattern.
Not all nodes are represented:
You can search for your favorite gene or module of interest.
Or view a module summary table