netDx 1.6.0
In this example we will build a predictor to classify breast tumours as being either of Luminal A subtype or otherwise. The process is identical for classifying three or more labels, and the example uses minimal data for quick runtime.
Although the netDx algorithm requires assay data to be provided in the form of a MultiAssayExperiment
object, the package comes equipped with the convertToMAE()
wrapper function to transform raw experimental assay data/tables into a MultiAssayExperiment
object. We will use data from The Cancer Genome Atlas to build the predictor, converting it from a MultiAssayExperiment
object into a list to illustrate how to utilize the convertToMAE()
wrapper function.
We will integrate two types of -omic data:
First, we load the netDx
package.
suppressWarnings(suppressMessages(require(netDx)))
For this example we pull data from the The Cancer Genome Atlas through the BioConductor curatedTCGAData
package.
suppressMessages(library(curatedTCGAData))
We fetch the two layers of data that we need:
brca <- suppressMessages(curatedTCGAData("BRCA",
c("mRNAArray",
"miRNASeqGene"),
dry.run=FALSE, version="1.1.38"))
The fetch command automatically brings in a MultiAssayExperiment
object.
summary(brca)
## Length Class Mode
## 2 MultiAssayExperiment S4
This next code block prepares the TCGA data. In practice you would do this once, and save the data before running netDx, but we run it here to see an end-to-end example.
source("prepare_data.R")
brca <- prepareData(brca,setBinary=TRUE)
## harmonizing input:
## removing 28 sampleMap rows with 'colname' not in colnames of experiments
## harmonizing input:
## removing 44 sampleMap rows with 'colname' not in colnames of experiments
The important thing is to create ID
and STATUS
columns in the sample metadata slot. netDx uses these to get the patient identifiers and labels, respectively.
pID <- colData(brca)$patientID
colData(brca)$ID <- pID
To build the predictor using the netDx algorithm, we call the buildPredictor()
function which takes patient data and variable groupings, and returns a set of patient similarity networks (PSN) as an output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype. This is done specifically by telling the model how to:
The relevant input parameters are:
groupList
: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)sims
: a list specifying similarity metrics for each data layergroupList
: Grouping variables to define featuresThe groupList
object tells the predictor how to group units when constructing a network. For examples, genes may be grouped into a network representing a pathway. This object is a list; the names match those of dataList
while each value is itself a list and reflects a potential network.
In this simple example we just create a single PSN for each datatype (mRNA gene expression, and miRNA expression data), containing all measures from that datatype, where measures can be individual genes, proteins, CpG bases (in DNA methylation data), clinical variables, etc.,
expr <- assays(brca)
groupList <- list()
for (k in 1:length(expr)) { # loop over all layers
cur <- expr[[k]]; nm <- names(expr)[k]
# all measures from this layer go into our single PSN
groupList[[nm]] <- list(nm=rownames(cur))
# assign same layer name as in input data
names(groupList[[nm]])[1] <- nm;
}
sims
: Define patient similarity for each networkWhat is this: sims
is used to define similarity metrics for each layer.
This is done by providing a single list - here, sims
- that specifies the choice of similarity metric to use for each data layer. The names()
for this list must match those in groupList
. The corresponding value can either be a character if specifying a built-in similarity function, or a function. The latter is used if the user wishes to specify a custom similarity function.
The current available options for built-in similarity measures are:
pearsonCorr
: Pearson correlation (n>5 measures in set)normDiff
: normalized difference (single measure such as age)avgNormDiff
: average normalized difference (small number of measures)sim.pearscale
: Pearson correlation followed by exponential scalingsim.eucscale
: Euclidean distance followed by exponential scalingIn this example, we choose Pearson correlation similarity for all data layers.
sims <- list(a="pearsonCorr", b="pearsonCorr")
names(sims) <- names(groupList)
Data pulled from The Cancer Genome Atlas through the BioConductor curatedTCGAData
package automatically fetches data in the form of a MultiAssayExperiment
object. However, most workflows that might utilize the netDx algorithm will have experimental assay data and patient metadata in the form of data frames/matrices/tables.
To facilitate ease-of-use, the netDx package has a built-in wrapper function convertToMAE()
that takes in an input list of key-value pairs of experimental assay data and patient metadata, converting it into a MultiAssayExperiment
object compatible with further analysis using the netDx algorithm. However, all relevant data engineering/preparation should be done before using the convertToMAE()
wrapper function.
This next code block converts the TCGA data into a list format to illustrate how one might use the convertToMAE()
wrapper function.
brcaData <- dataList2List(brca, groupList)
The keys of the input list of key-value pairs should be labelled according to the type of data corresponding to the value pairs (methylation, mRNA, proteomic, etc) and there must be a key-value pair that corresponds to patient IDs/metadata labelled pheno
.
brcaList <- brcaData$assays
brcaList <- c(brcaList, list(brcaData$pheno))
names(brcaList)[3] <- "pheno"
We can now call the convertToMAE()
wrapper function to convert the list containing experimental assay data and patient metadata into a MultiAssayExperiment
object.
brca <- convertToMAE(brcaList)
We can then proceed with the rest of the netDx workflow.
Now we’re ready to train our model. netDx uses parallel processing to speed up compute time. Let’s use 75% available cores on the machine for this example. netDx also throws an error if provided an output directory that already has content, so let’s clean that up as well.
nco <- round(parallel::detectCores()*0.75) # use 75% available cores
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))
## Using 54 of 72 cores
outDir <- paste(tempdir(),"pred_output",sep=getFileSep()) # use absolute path
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
numSplits <- 2L
Finally we call the function that runs the netDx predictor. We provide:
dataList
)groupList
)sims
)numSplits
),featScoreMax
, set to 10)featSelCutoff
); only features scoring this value or higher will be used to classify test patients,numCores
).The call below runs two train/test splits. Within each split, it:
trainProp=0.8
)featScoreMax=2L
)featSelCutoff=1L
) to classify test samples for that split.In practice a good starting point is featScoreMax=10
, featSelCutoff=9
and numSplits=10L
, but these parameters depend on the sample sizes in the dataset and heterogeneity of the samples.
t0 <- Sys.time()
set.seed(42) # make results reproducible
model <- suppressMessages(
buildPredictor(
dataList=brca, ## your data
groupList=groupList, ## grouping strategy
sims = sims,
outDir=outDir, ## output directory
trainProp=0.8, ## pct of samples to use to train model in each split
numSplits=2L, ## number of train/test splits
featSelCutoff=1L, ## threshold for calling something feature-selected
featScoreMax=2L, ## max score for feature selection
numCores=nco, ## set higher for parallelizing
debugMode=FALSE,
keepAllData=FALSE, ## set to TRUE for debugging or low-level files used by the dictor
logging="none"
))
t1 <- Sys.time()
print(t1-t0)
## Time difference of 12.57687 mins
Now we get model output, including performance for various train/test splits and consistently high-scoring features.
In the function below, we define top-scoring features as those which score two out of two in at least half of the train/test splits:
results <- getResults(model,unique(colData(brca)$STATUS),
featureSelCutoff=2L,featureSelPct=0.50)
## Detected 2 splits and 2 classes
## * Plotting performance
## * Compiling feature scores and calling selected features
results
contains performance
, selectedFeatures
for each patient label, and the table of feature scores
.
summary(results)
## Length Class Mode
## selectedFeatures 2 -none- list
## featureScores 2 -none- list
## performance 4 -none- list
Look at the performance:
results$performance
## $meanAccuracy
## [1] 0.9015152
##
## $splitAccuracy
## [1] 0.9393939 0.8636364
##
## $splitAUROC
## [1] 0.9956522 0.9728261
##
## $splitAUPR
## [1] 0.9763790 0.9661563
Look at feature scores for all labels, across all train-test splits:
results$featureScores
## $Luminal.A
## Feature Split1 Split2
## 1 BRCA_mRNAArray-20160128 2 2
## 2 BRCA_miRNASeqGene-20160128 1 1
##
## $other
## Feature Split1 Split2
## 1 BRCA_mRNAArray-20160128 2 2
## 2 BRCA_miRNASeqGene-20160128 1 2
Let’s examine our confusion matrix:
confMat <- confusionMatrix(model)
Note: Rows of this matrix don’t add up to 100% because the matrix is an average of the confusion matrices from all of the train/test splits.
And here are selected features, which are those scoring 2 out of 2 in at least half of the splits. This threshold is simply for illustration. In practice we would run at least 10 train/test splits (ideally 100+), and look for features that score 7+ out of 10 in >70% splits.
results$selectedFeatures
## $Luminal.A
## [1] "BRCA_mRNAArray-20160128"
##
## $other
## [1] "BRCA_mRNAArray-20160128" "BRCA_miRNASeqGene-20160128"
We finally get the integrated PSN and visualize it using a tSNE plot:
## this call doesn't work in Rstudio; for now we've commented this out and saved the PSN file.
psn <- getPSN(brca,groupList,sims = sims,selectedFeatures=results$selectedFeatures)
## BRCA_miRNASeqGene-20160128: 1 features
## BRCA_mRNAArray-20160128: 1 features
## * Making integrated PSN
## Making nets from sims
## BRCA_miRNASeqGene-20160128
## Layer BRCA_miRNASeqGene-20160128: PEARSON CORR
## Pearson similarity chosen - enforcing min. 5 patients per net.
## BRCA_mRNAArray-20160128
## Layer BRCA_mRNAArray-20160128: PEARSON CORR
## Pearson similarity chosen - enforcing min. 5 patients per net.
## Net construction complete!
## Warning in dir.create(paste(netDir, "profiles", sep = fsep)): '/tmp/Rtmpz1rz5l/
## profiles' already exists
## * Computing aggregate net
##
## 91467 pairs have no edges (counts directed edges)
## Sparsity = 14809/52975 (28 %)
##
## * Prune network
## * plotCytoscape is set to FALSE.Set to TRUE to visualize patient network in Cytoscape
require(Rtsne)
## Loading required package: Rtsne
tsne <- tSNEPlotter(
psn$patientSimNetwork_unpruned,
colData(brca)
)
## * Making symmetric matrix
## * Running tSNE
## * Plotting
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Rtsne_0.15 curatedTCGAData_1.15.1
## [3] MultiAssayExperiment_1.20.0 SummarizedExperiment_1.24.0
## [5] Biobase_2.54.0 GenomicRanges_1.46.0
## [7] GenomeInfoDb_1.30.0 IRanges_2.28.0
## [9] S4Vectors_0.32.0 BiocGenerics_0.40.0
## [11] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [13] netDx_1.6.0 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5
## [3] filelock_1.0.2 doParallel_1.0.16
## [5] RColorBrewer_1.1-2 httr_1.4.2
## [7] tools_4.1.1 bslib_0.3.1
## [9] utf8_1.2.2 R6_2.5.1
## [11] DBI_1.1.1 colorspace_2.0-2
## [13] withr_2.4.2 tidyselect_1.1.1
## [15] bit_4.0.4 curl_4.3.2
## [17] compiler_4.1.1 glmnet_4.1-2
## [19] DelayedArray_0.20.0 labeling_0.4.2
## [21] bookdown_0.24 sass_0.4.0
## [23] scales_1.1.1 rappdirs_0.3.3
## [25] stringr_1.4.0 digest_0.6.28
## [27] rmarkdown_2.11 XVector_0.34.0
## [29] pkgconfig_2.0.3 htmltools_0.5.2
## [31] plotrix_3.8-2 highr_0.9
## [33] dbplyr_2.1.1 fastmap_1.1.0
## [35] rlang_0.4.12 RSQLite_2.2.8
## [37] shiny_1.7.1 farver_2.1.0
## [39] shape_1.4.6 jquerylib_0.1.4
## [41] generics_0.1.1 combinat_0.0-8
## [43] jsonlite_1.7.2 dplyr_1.0.7
## [45] RCurl_1.98-1.5 magrittr_2.0.1
## [47] GenomeInfoDbData_1.2.7 Matrix_1.3-4
## [49] Rcpp_1.0.7 munsell_0.5.0
## [51] fansi_0.5.0 lifecycle_1.0.1
## [53] stringi_1.7.5 yaml_2.2.1
## [55] zlibbioc_1.40.0 AnnotationHub_3.2.0
## [57] plyr_1.8.6 BiocFileCache_2.2.0
## [59] grid_4.1.1 blob_1.2.2
## [61] promises_1.2.0.1 parallel_4.1.1
## [63] ExperimentHub_2.2.0 bigmemory.sri_0.1.3
## [65] crayon_1.4.1 lattice_0.20-45
## [67] Biostrings_2.62.0 splines_4.1.1
## [69] KEGGREST_1.34.0 knitr_1.36
## [71] pillar_1.6.4 igraph_1.2.7
## [73] reshape2_1.4.4 codetools_0.2-18
## [75] BiocVersion_3.14.0 glue_1.4.2
## [77] evaluate_0.14 BiocManager_1.30.16
## [79] png_0.1-7 httpuv_1.6.3
## [81] vctrs_0.3.8 foreach_1.5.1
## [83] gtable_0.3.0 purrr_0.3.4
## [85] assertthat_0.2.1 cachem_1.0.6
## [87] ggplot2_3.3.5 xfun_0.27
## [89] mime_0.12 xtable_1.8-4
## [91] pracma_2.3.3 later_1.3.0
## [93] survival_3.2-13 tibble_3.1.5
## [95] iterators_1.0.13 AnnotationDbi_1.56.0
## [97] memoise_2.0.0 interactiveDisplayBase_1.32.0
## [99] bigmemory_4.5.36 ellipsis_0.3.2
## [101] ROCR_1.0-11