If you use the data from this package in published research, please cite:
Tianle Ma, Aidong Zhang, Integrate Multi-omic Data Using Affinity Network Fusion (ANF) for Cancer Patient Clustering, https://arxiv.org/abs/1708.07136
There are three R objects included in this package: Wall
, project_ids
and survival.plot
:
inst/scripts/make-data.R explains in detail how Wall
, project_ids
and survival.plot
was generated from original data downloaded from GDC data portal.
In short, Wall
contains a complex list of precomputed patient affinity (similarity) matrices. In fact, Wall
is a list (five cancer types) of list (six feature normalization types: raw.all
, raw.sel
, log.all
, log.sel
, vst.sel
, normalized
) of list (three feature spaces or views: fpkm
, mirna
, and methy450
) of pre-computed patient affinity matrices. (So in total, there are 90 matrices.) The rownames of each matrix are case IDs (i.e., patient IDs), and the column names of each matrix are the aliquot IDs (i.e., TCGA barcode, which contains the case ID as prefix). Based on these aliquot IDs, users can download the original data about these patients from https://portal.gdc.cancer.gov/repository. The file UUIDs of 10328 files used for deriving Wall
are also provided in inst/extdata/fileUUIDs.csv
project_ids
is a named character vector that maps case_id to TCGA project_id. Because each project_id corresponds to one disease type, project_ids
contains information about patient disease type information. Since our goal is to cluster cancer patients into disease types, project_ids
is used for evaluating clustering results, such as calculating Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI).
surv.plot
is a data.frame containing patient survival data for survival analysis downloaded from https://portal.gdc.cancer.gov/exploration?searchTableTab=genes (overall survival plot data), providing an “indirect” way to evaluate clustering results.
See paper https://arxiv.org/abs/1708.07136 for more explanation.
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation
## loading from cache
## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation
## loading from cache
## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation
## loading from cache
Comprehensive molecular profiling data of dozens of cancer types have been made available by TCGA. We selected cancers from five primary sites (in the following we refer to the names of primary sites as cancer types). Each of these five cancer types has at least two known disease types. For other cancer types, we do not know the groundtruth disease types. That’s why they were not included in the package.
## [1] "adrenal_gland" "lung" "uterus" "kidney"
## [5] "colorectal"
We are trying to cluster patients into groups and identify cancer subtypes (i.e., disease types) using multi-omic data. Here we include three types of data: gene expression, miRNA expression and DNA methylation beta values. We selected 2582 patients who has all these data available and included them in this package.
names(Wall[[1]][[1]]) #Note: "fpkm" refers to gene expression measurement, which can be HTSeq-Counts, transformed HTSeq-Counts (log2 transformation or variance-stabilizing transformation), and FPKM values. Sorry for the confusing name.
## [1] "fpkm" "mirnas" "methy450"
For raw counts data (measuring gene expression and miRNA expression), we can perform feature selection (e.g., differential expression analysis) and feature transformation (e.g., log2 transformation and variance-stabilizing transformation). We have included six feature types in this package:
raw.all
: Raw counts of all genes or miRNAs
raw.sel
: Raw counts of selected (differentially expressed) genes or miRNAs (Differential expression analysis was performed using DESeq2)
log.all
: Log transformation of raw counts of all genes or miRNAs
log.sel
: Log transformation of raw counts of selected (differentially expressed) genes or miRNAs
vst.sel
: Variance stabilizing transformation of raw counts of selected genes or miRNAs
normalized
: FPKM values of all genes or normalized counts for all miRNAs
## [1] "raw.all" "raw.sel" "log.all" "log.sel" "vst.sel"
## [6] "normalized"
So for each of the five cancer types, we have the above six feature types (raw.all
, raw.sel
, log.all
, log.sel
, vst.sel
and normalized
). For each feature type, we have three “views”: gene expression (named fpkm
, i.e., names(Wall[[1]][[1]])[1]
), miRNA expression (mirnas
), and DNA methylation beta values (methy450
). In total, there are 90 matrices contained in Wall
. (For DNA methylation, we directly used beta values without feature transformation. So the six methy450
matrices are the same. Thus we actually only have 65 unique matrices in Wall
.)
We can perform spectral clustering on a patient affinity matrix. Take adrend_gland cancer for example. We can cluseter patients using affinity matrix derived from log2 transformation of raw counts of differentially expressed genes.
library(ANF)
affinity.mat <- Wall[["adrenal_gland"]][["log.sel"]][["fpkm"]]
labels <- spectral_clustering(affinity.mat, k = 2)
Since we know true disease types, which correspond to project ids in project_ids
, we can calculate NMI and ARI.
true.disease.types <- as.factor(project_ids[rownames(affinity.mat)])
print(table(labels, true.disease.types))
## true.disease.types
## labels TCGA-ACC TCGA-PCPG
## 1 0 176
## 2 76 1
nmi <- igraph::compare(true.disease.types, labels, method = "nmi")
adjusted.rand = igraph::compare(true.disease.types, labels, method = "adjusted.rand")
# we can also calculate p-value using `surv.plot` data
surv.plot <- surv.plot[rownames(affinity.mat), ]
f <- survival::Surv(surv.plot$time, !surv.plot$censored)
fit <- survival::survdiff(f ~ labels)
pval <- stats::pchisq(fit$chisq, df = length(fit$n) - 1, lower.tail = FALSE)
print(paste("NMI =", nmi, ", ARI =", adjusted.rand, ", p-val =", pval))
## [1] "NMI = 0.962881312440823 , ARI = 0.983811442595234 , p-val = 1.46669868477277e-07"
In ANF package, We have provided a function eval_clu
that streamlines the above process from spectral clustering to calculating NMI, ARI and p-value. Here is an example of how to use eval_clu
:
## nmi = 0.962881312440823
## adjusted.rand = 0.983811442595234
## survdiff p value = 1.46669868477277e-07
## labels
## true_class 1 2
## TCGA-ACC 0 76
## TCGA-PCPG 176 1
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [75] 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
For adrenal_gland cancer, we only misclassify one out of 253 patients using this affinity matrix. That is a pretty good result (In fact, this the best result we can achieve. Users can try using other matrices and compare the results). However, for many cases, using a single affinity matrix does a “terrible” job in clustering patients into correct disease types. Take uterus cancer for example (the NMI is near 0).
## nmi = 0.0563331568619408
## adjusted.rand = -0.0550627074255129
## survdiff p value = NA
## labels
## true_class 1 2
## TCGA-UCEC 153 268
## TCGA-UCS 3 51
Instead of using one affinity matrix, we can “fuse” multiple affinity matrices using ANF package, and then perform spectral clustering on the fused affinity matrix.
Let’s take uterus cancer for example.
# fuse three matrices: "fpkm" (gene expression), "mirnas" (miRNA expression) and "methy450" (DNA methylation)
fused.mat <- ANF(Wall = Wall$uterus$raw.all)
# Spectral clustering on fused patient affinity matrix
labels <- spectral_clustering(A = fused.mat, k = 2)
# Or we can directly evaluate clustering results using function `eval_clu`, which calls `spectral_clustering` and calculate NMI and ARI (and p-value if patient survival data is available. `surv.plot` does not contain information for uterus cancer patients)
res <- eval_clu(true_class = project_ids[rownames(fused.mat)], w = fused.mat)
## nmi = 0.48526760996706
## adjusted.rand = 0.684204025562809
## survdiff p value = NA
## labels
## true_class 1 2
## TCGA-UCEC 410 11
## TCGA-UCS 14 40
As we can see, spectral clustering on the fused affinity matrix significantly improves the results for uterus cancer. This demonstrate the power of ANF. The paper https://arxiv.org/abs/1708.07136 have provided more results.