bluster 1.6.0
The bluster package provides a few diagnostics for quantitatively examining the cluster output.
These diagnostics
We will demonstrate on another dataset from the scRNAseq package,
clustered with graph-based methods via the clusterRows()
generic as described in the previous vignette.
library(scRNAseq)
sce <- GrunPancreasData()
# Quality control to remove bad cells.
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
# Normalization by library size.
sce <- logNormCounts(sce)
# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)
# Dimensionality reduction.
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
# Clustering.
library(bluster)
mat <- reducedDim(sce)
clust.info <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <- clust.info$clusters
table(clusters)
## clusters
## 1 2 3 4 5 6 7 8 9 10 11 12
## 285 171 161 59 174 49 70 137 69 65 28 23
The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.
The exact silhouette calculation is rather computationally intensive so bluster implements an approximation instead.
This is provided in the approxSilhouette()
function, which returns the width for each cell and its closest (non-self) cluster.
Clusters consisting of cells with lower widths may warrant some more care during interpretation.
sil <- approxSilhouette(mat, clusters)
sil
## DataFrame with 1291 rows and 3 columns
## cluster other width
## <factor> <factor> <numeric>
## D2ex_1 1 7 0.375510
## D2ex_2 1 7 0.370596
## D2ex_3 1 7 0.409271
## D2ex_4 5 3 0.257069
## D2ex_5 5 3 0.249884
## ... ... ... ...
## D17TGFB_91 6 2 0.126608
## D17TGFB_92 1 5 0.403107
## D17TGFB_93 1 7 0.384364
## D17TGFB_94 6 2 0.243933
## D17TGFB_95 6 2 0.204860
boxplot(split(sil$width, clusters))
The function also returns the identity of the closest “other” cluster for each cell. This can be helpful to identify which clusters are easily confused to each other, based on how many of one cluster’s cells are closer to the other cluster.
best.choice <- ifelse(sil$width > 0, clusters, sil$other)
table(Assigned=clusters, Closest=best.choice)
## Closest
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 275 0 7 0 3 0 0 0 0 0 0 0
## 2 0 171 0 0 0 0 0 0 0 0 0 0
## 3 0 0 150 0 11 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 0 0 21 0 153 0 0 0 0 0 0 0
## 6 0 26 0 0 0 23 0 0 0 0 0 0
## 7 14 0 8 10 2 0 18 0 7 2 8 1
## 8 0 0 0 1 0 0 0 136 0 0 0 0
## 9 0 0 0 0 0 0 0 10 58 0 0 1
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., “purities”) as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.
The neighborPurity()
function computes the purity of the neighborhood for each cell.
Clusters with systematically low purities may warrant some more care during interpretation.
By default, we perform some weighting so that large clusters do not have large purities simply because there are few cells assigned to other clusters in the dataset.
pure <- neighborPurity(mat, clusters)
pure
## DataFrame with 1291 rows and 2 columns
## purity maximum
## <numeric> <factor>
## D2ex_1 1.000000 1
## D2ex_2 1.000000 1
## D2ex_3 1.000000 1
## D2ex_4 0.951603 5
## D2ex_5 1.000000 5
## ... ... ...
## D17TGFB_91 0.959277 6
## D17TGFB_92 1.000000 1
## D17TGFB_93 1.000000 1
## D17TGFB_94 0.986538 6
## D17TGFB_95 0.898817 6
boxplot(split(pure$purity, clusters))
The function also returns the identity of the other cluster with the highest percentage. This can again be useful to identify the relationships between clusters based on which pairs have the greatest intermingling in their neighborhoods.
table(Assigned=clusters, Max=pure$maximum)
## Max
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 279 0 0 0 0 0 6 0 0 0 0 0
## 2 0 170 0 0 0 0 0 0 0 0 0 1
## 3 0 0 158 0 3 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 1 0 6 0 165 0 2 0 0 0 0 0
## 6 0 18 0 0 0 31 0 0 0 0 0 0
## 7 0 0 0 1 0 0 67 0 1 0 1 0
## 8 0 0 0 0 0 0 0 136 1 0 0 0
## 9 0 0 0 0 0 0 0 5 64 0 0 0
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
The root-mean-squared deviation (RMSD) for each cluster represents the dispersion of cells within each cluster. A large RMSD value indicates that a cluster has high internal heterogeneity, making it a good candidate for further subclustering.
rmsd <- clusterRMSD(mat, clusters)
barplot(rmsd)
Alternatively, we can compute the within-cluster sum of squares (WCSS), a metric commonly seen in \(k\)-means clustering. One could pick a “sensible” choice for \(k\) by computing the WCSS for a range of values and picking the value the WCSS begins to plateau - see Section 8 for more details.
clusterRMSD(mat, clusters, sum=TRUE)
## 1 2 3 4 5 6 7 8
## 60368.807 9537.386 26566.192 19009.470 30701.291 8244.528 31829.154 14682.105
## 9 10 11 12
## 10652.129 6094.776 2644.849 1956.404
For graph-based methods, we can compute the cluster modularity within clusters and between pairs of clusters.
Specifically, we examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many cluster_*
functions from igraph).
We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated.
Large off-diagonal entries indicate that the corresponding pair of clusters are closely related.
g <- clust.info$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)
library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
col=rev(heat.colors(100)))
This may be better visualized with a force-directed layout:
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1),
mode="upper", weighted=TRUE, diag=FALSE)
# Increasing the weight to increase the visibility of the lines.
set.seed(1100101)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
layout=igraph::layout_with_lgl)
We can also tune the resolution of the clustering post hoc with the mergeCommunities()
function.
This will iteratively merge the most closely related pair of clusters together until the desired number of clusters is reached.
For example, if we wanted to whittle down the number of clusters to 10, we could do:
merged <- mergeCommunities(g, clusters, number=10)
table(merged)
## merged
## 1 2 3 5 6 7 9 10 11 12
## 285 171 161 174 49 129 206 65 28 23
To compare two clusterings, the pairwiseRand()
function computes the adjusted Rand index (ARI).
High ARIs indicate that the two clusterings are similar with respect to how they partition the observations,
and an ARI of 1 means that the clusterings are identical.
hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")
## [1] 0.4512108
Of course, a single number is not particularly useful,
so clusterRand()
also provides the capability to break down the ARI into its contributions from each cluster or cluster pair.
Specifically, for each cluster or cluster pair in a “reference” clustering (here, clusters
),
we see whether it is preserved in the “alternative” clustering (here, hclusters
).
Large values on the diagonal indicate that the reference cluster is recapitulated;
large values off the diagonal indicate that the separation between the corresponding pair of clusters is also maintained.
Conversely, low diagonal values indicate that the corresponding cluster is fragmented in the alternative,
and low off-diagonal values can be used as a diagnostic for loss of separation.
ratio <- pairwiseRand(clusters, hclusters, mode="ratio")
library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
col=viridis::viridis(100), breaks=seq(-1, 1, length=101))
Explicit mappings between two clusterings can be performed using linkClusters()
(see Section 9).
Alternatively, we can quantify the degree of “nesting” of one clustering within another with nestedClusters()
;
this can be useful for verifying that the higher-resolution clustering is indeed nested within its coarser counterpart.
We can use bootstrapping to evaluate the effect of sampling noise on the stability of a clustering procedure.
The bootstrapStability()
function will return the ARI of the original clusters against those generated from bootstrap replicates,
averaged across multiple bootstrap iterations.
High values indicate that the clustering is robust to sample noise.
set.seed(1001010)
ari <-bootstrapStability(mat, clusters=clusters,
mode="index", BLUSPARAM=NNGraphParam())
ari
## [1] 0.7147913
Advanced users may also set mode="ratio"
to obtain a more detailed breakdown of the effect of noise on each cluster (pair).
set.seed(1001010)
ratio <-bootstrapStability(mat, clusters=clusters,
mode="ratio", BLUSPARAM=NNGraphParam())
library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
col=viridis::viridis(100), breaks=seq(-1, 1, length=101))
The clusterSweep()
function provides a convenient way to test multiple combinations of parameter settings.
Given a BlusterParam
object and a set of values for each parameter, the function will repeat the clustering ith each combination of parameters.
The example below uses graph-based clustering with a variety of k
as well as different community detection algorithms.
combinations <- clusterSweep(mat, BLUSPARAM=SNNGraphParam(),
k=c(5L, 10L, 15L, 20L), cluster.fun=c("walktrap", "louvain", "infomap"))
This yields a list containing all clusterings and the corresponding parameter combinations used to generate them. The function will attempt to generate some sensible name for each combination, though this may require some manual curation for large numbers of parameters.
colnames(combinations$clusters)
## [1] "k.5_cluster.fun.walktrap" "k.10_cluster.fun.walktrap"
## [3] "k.15_cluster.fun.walktrap" "k.20_cluster.fun.walktrap"
## [5] "k.5_cluster.fun.louvain" "k.10_cluster.fun.louvain"
## [7] "k.15_cluster.fun.louvain" "k.20_cluster.fun.louvain"
## [9] "k.5_cluster.fun.infomap" "k.10_cluster.fun.infomap"
## [11] "k.15_cluster.fun.infomap" "k.20_cluster.fun.infomap"
combinations$parameters
## DataFrame with 12 rows and 2 columns
## k cluster.fun
## <integer> <character>
## k.5_cluster.fun.walktrap 5 walktrap
## k.10_cluster.fun.walktrap 10 walktrap
## k.15_cluster.fun.walktrap 15 walktrap
## k.20_cluster.fun.walktrap 20 walktrap
## k.5_cluster.fun.louvain 5 louvain
## ... ... ...
## k.20_cluster.fun.louvain 20 louvain
## k.5_cluster.fun.infomap 5 infomap
## k.10_cluster.fun.infomap 10 infomap
## k.15_cluster.fun.infomap 15 infomap
## k.20_cluster.fun.infomap 20 infomap
We can combine this with some of the metrics defined above to quantify cluster separation as a function of the clustering parameters. This allows us to quickly determine which parameters have a noticeable impact on the results. In principle, we could choose the clustering with the greatest separation for further analysis; however, this tends to be disappointing as it often favors overly broad clusters.
set.seed(10)
nclusters <- 3:25
kcombos <- clusterSweep(mat, BLUSPARAM=KmeansParam(centers=5), centers=nclusters)
sil <- vapply(as.list(kcombos$clusters), function(x) mean(approxSilhouette(mat, x)$width), 0)
plot(nclusters, sil, xlab="Number of clusters", ylab="Average silhouette width")
pur <- vapply(as.list(kcombos$clusters), function(x) mean(neighborPurity(mat, x)$purity), 0)
plot(nclusters, pur, xlab="Number of clusters", ylab="Average purity")
wcss <- vapply(as.list(kcombos$clusters), function(x) sum(clusterRMSD(mat, x, sum=TRUE)), 0)
plot(nclusters, wcss, xlab="Number of clusters", ylab="Within-cluster sum of squares")
If we have many clusterings, we can identify corresponding clusters with the linkClusters()
function.
This constructs a graph where edges are formed between pairs of clusters from different clusterings, based on the number of cells assigned to both clusters.
Re-using some of the clusterings from our previous sweep, we might do:
linked <- linkClusters(
list(
walktrap=combinations$clusters$k.10_cluster.fun.walktrap,
louvain=combinations$clusters$k.10_cluster.fun.louvain,
infomap=combinations$clusters$k.10_cluster.fun.infomap
)
)
linked
## IGRAPH baee0ce UNW- 44 81 --
## + attr: name (v/c), weight (e/n)
## + edges from baee0ce (vertex names):
## [1] walktrap.1 --louvain.1 walktrap.3 --louvain.1 walktrap.7 --louvain.1
## [4] walktrap.1 --louvain.2 walktrap.3 --louvain.2 walktrap.5 --louvain.2
## [7] walktrap.3 --louvain.3 walktrap.5 --louvain.3 walktrap.7 --louvain.3
## [10] walktrap.11--louvain.3 walktrap.3 --louvain.4 walktrap.4 --louvain.5
## [13] walktrap.5 --louvain.5 walktrap.7 --louvain.5 walktrap.6 --louvain.6
## [16] walktrap.8 --louvain.6 walktrap.10--louvain.7 walktrap.2 --louvain.8
## [19] walktrap.6 --louvain.8 walktrap.3 --louvain.9 walktrap.5 --louvain.9
## [22] walktrap.8 --louvain.10 walktrap.9 --louvain.10 walktrap.12--louvain.11
## + ... omitted several edges
This can be used to visualize the relationships between clusters, or to identify metaclusters across clusterings with community detection algorithms:
meta <- igraph::cluster_walktrap(linked)
plot(linked, mark.groups=meta)
By default, the edge weights are computed by dividing the number of shared cells with the smaller of the total number of cells in either cluster. This favors strong edges between a large cluster in one clustering and smaller subcluster in another (finer) clustering. Alternative weighting schemes will favour a 1:1 mapping between clusterings, which can be easier to interpret.
The compareClusterings()
function will return a symmetric matrix of the ARIs between pairs of different clusterings.
This is helpful for visualizing the relationships between different clusterings, e.g., to see which parameters most contribute to differences between clusterings.
aris <- compareClusterings(combinations$clusters)
g <- igraph::graph.adjacency(aris, mode="undirected", weighted=TRUE)
meta2 <- igraph::cluster_walktrap(g)
plot(g, mark.groups=meta2)
We can also identify groups of clusterings, typically corresponding to parameter combinations that yield more-or-less similar results. This allows us to prune out combinations that are largely redundant prior to downstream analyses.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 bluster_1.6.0
## [3] scater_1.24.0 ggplot2_3.3.5
## [5] scran_1.24.0 scuttle_1.6.0
## [7] scRNAseq_2.9.2 SingleCellExperiment_1.18.0
## [9] SummarizedExperiment_1.26.0 Biobase_2.56.0
## [11] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0
## [13] IRanges_2.30.0 S4Vectors_0.34.0
## [15] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
## [17] matrixStats_0.62.0 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] AnnotationHub_3.4.0 BiocFileCache_2.4.0
## [3] igraph_1.3.1 lazyeval_0.2.2
## [5] gmp_0.6-5 BiocParallel_1.30.0
## [7] benchmarkme_1.0.7 digest_0.6.29
## [9] foreach_1.5.2 ensembldb_2.20.0
## [11] htmltools_0.5.2 magick_2.7.3
## [13] viridis_0.6.2 fansi_1.0.3
## [15] magrittr_2.0.3 memoise_2.0.1
## [17] mbkmeans_1.12.0 ScaledMatrix_1.4.0
## [19] doParallel_1.0.17 cluster_2.1.3
## [21] limma_3.52.0 Biostrings_2.64.0
## [23] prettyunits_1.1.1 colorspace_2.0-3
## [25] ggrepel_0.9.1 blob_1.2.3
## [27] rappdirs_0.3.3 xfun_0.30
## [29] dplyr_1.0.8 kohonen_3.0.11
## [31] crayon_1.5.1 RCurl_1.98-1.6
## [33] jsonlite_1.8.0 iterators_1.0.14
## [35] glue_1.6.2 gtable_0.3.0
## [37] zlibbioc_1.42.0 XVector_0.36.0
## [39] DelayedArray_0.22.0 BiocSingular_1.12.0
## [41] apcluster_1.4.9 scales_1.2.0
## [43] DBI_1.1.2 edgeR_3.38.0
## [45] Rcpp_1.0.8.3 viridisLite_0.4.0
## [47] xtable_1.8-4 progress_1.2.2
## [49] dqrng_0.3.0 bit_4.0.4
## [51] rsvd_1.0.5 metapod_1.4.0
## [53] httr_1.4.2 RColorBrewer_1.1-3
## [55] FNN_1.1.3 ellipsis_0.3.2
## [57] ClusterR_1.2.6 farver_2.1.0
## [59] pkgconfig_2.0.3 XML_3.99-0.9
## [61] uwot_0.1.11 sass_0.4.1
## [63] dbplyr_2.1.1 dynamicTreeCut_1.63-1
## [65] locfit_1.5-9.5 utf8_1.2.2
## [67] labeling_0.4.2 tidyselect_1.1.2
## [69] rlang_1.0.2 later_1.3.0
## [71] AnnotationDbi_1.58.0 munsell_0.5.0
## [73] BiocVersion_3.15.2 tools_4.2.0
## [75] cachem_1.0.6 cli_3.3.0
## [77] generics_0.1.2 RSQLite_2.2.12
## [79] ExperimentHub_2.4.0 evaluate_0.15
## [81] stringr_1.4.0 fastmap_1.1.0
## [83] yaml_2.3.5 knitr_1.38
## [85] bit64_4.0.5 purrr_0.3.4
## [87] KEGGREST_1.36.0 AnnotationFilter_1.20.0
## [89] sparseMatrixStats_1.8.0 mime_0.12
## [91] xml2_1.3.3 biomaRt_2.52.0
## [93] compiler_4.2.0 beeswarm_0.4.0
## [95] filelock_1.0.2 curl_4.3.2
## [97] png_0.1-7 interactiveDisplayBase_1.34.0
## [99] tibble_3.1.6 statmod_1.4.36
## [101] bslib_0.3.1 stringi_1.7.6
## [103] highr_0.9 RSpectra_0.16-1
## [105] GenomicFeatures_1.48.0 lattice_0.20-45
## [107] ProtGenerics_1.28.0 Matrix_1.4-1
## [109] vctrs_0.4.1 pillar_1.7.0
## [111] lifecycle_1.0.1 BiocManager_1.30.17
## [113] jquerylib_0.1.4 BiocNeighbors_1.14.0
## [115] cowplot_1.1.1 bitops_1.0-7
## [117] irlba_2.3.5 httpuv_1.6.5
## [119] rtracklayer_1.56.0 R6_2.5.1
## [121] BiocIO_1.6.0 bookdown_0.26
## [123] promises_1.2.0.1 gridExtra_2.3
## [125] vipor_0.4.5 codetools_0.2-18
## [127] benchmarkmeData_1.0.4 gtools_3.9.2
## [129] assertthat_0.2.1 rjson_0.2.21
## [131] withr_2.5.0 GenomicAlignments_1.32.0
## [133] Rsamtools_2.12.0 GenomeInfoDbData_1.2.8
## [135] parallel_4.2.0 hms_1.1.1
## [137] grid_4.2.0 beachmat_2.12.0
## [139] rmarkdown_2.14 DelayedMatrixStats_1.18.0
## [141] shiny_1.7.1 ggbeeswarm_0.6.0
## [143] restfulr_0.0.13