The CellMixS package is a toolbox to explore and compare group effects in single-cell RNA-seq data. It has two major applications:
For this purpose it introduces two new metrics:
Besides this, several exploratory plotting functions enable evaluation of key integration and mixing features.
CellMixS can be installed from Bioconductor as following.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("CellMixS")
After installation the package can be loaded into R.
library(CellMixS)
CellMixS uses the SingleCellExperiment
class from the SingleCellExperiment Bioconductor
package as the format for input data.
The package contains example data named sim_50, a list of simulated single-cell RNA-seq data with varying batch effect strength and unbalanced batch sizes.
Batch effects were introduced by sampling 0%, 20% or 50% of gene expression values from a distribution with variant mean (e.g. 0% - 50% of genes were affected by a batch effect).
All datasets consist of 3 batches, one with 300 cells and the others with half of its size (so 150 cells). The simulation is modified after (Büttner et al. 2019) and described in sim50.
# load required packages
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(cowplot)
library(limma)
library(magrittr)
library(dplyr)
library(purrr)
library(ggplot2)
})
# load sim_list example data
sim_list <- readRDS(system.file(file.path("extdata", "sim50.rds"),
package = "CellMixS"))
names(sim_list)
#> [1] "batch0" "batch20" "batch50"
sce50 <- sim_list[["batch50"]]
class(sce50)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"
table(sce50[["batch"]])
#>
#> 1 2 3
#> 300 150 150
Often batch effects can already be detected by visual inspection and simple
visualization (e.g. in a normal tSNE or UMAP plot) depending on the strength. CellMixS has different plotting functions to
visualize group label and mixing scores aside without the need for using
different packages. Results are ggplot
objects and can be further customized
using ggplot2. Other packages, such as scater, provide similar plotting functions and could be
used as well.
#visualize batch distribution in sce50
visGroup(sce50, group = "batch")
#visualize batch distribution in other elements of sim_list
batch_names <- c("batch0", "batch20")
vis_batch <- lapply(batch_names, function(name){
sce <- sim_list[[name]]
visGroup(sce, "batch") + ggtitle(paste0("sim_", name))
})
plot_grid(plotlist = vis_batch, ncol = 2)
Not all batch effects or group differences are obvious using visualization. Also, in single-cell experiments celltypes and cells can be affected in different ways by experimental conditions causing batch effects (e.g. some cells are more robust to storing conditions than others).
Furthermore the range of methods for data integration and batch effect removal gives rise to the question of which method performs best on which data, and thereby a need to quantify batch effects.
The cellspecific mixing score cms
tests for each cell the hypothesis that batch-specific distance distributions towards it’s k-nearest neighbouring (knn)
cells are derived from the same unspecified underlying distribution using the Anderson-Darling test (Scholz and Stephens 1987). Results from the cms
function are two
scores cms and cms_smooth, where the latter is the weighted mean within
each cell’s neighbourhood. They can be interpreted as the probability that
batches within this cell’s neighbourhood are equally mixed. A high cms
score
refers to good mixing, while a low score indicates batch-specific bias.
The test considers differences in the number of cells from each batch.
This facilitates the cms
score to differentiate between unbalanced batches
(e.g. one celltype being more abundant in a specific batch) and a biased
distribution of cells. The cms
function takes an SingleCellExperiment
object (described in SingleCellExperiment) as input
and appends to the colData slot.
#call cell-specific mixing score for sce50
# Note cell_min is set to 4 due to the low number of cells and small k.
# Usually default params are recommended!!
sce50 <- cms(sce50, k = 30, group = "batch", res_name = "unaligned", n_dim = 3,
cell_min = 4)
head(colData(sce50))
#> DataFrame with 6 rows and 3 columns
#> batch cms_smooth.unaligned cms.unaligned
#> <factor> <numeric> <numeric>
#> cell_1 1 0 0
#> cell_2 1 0 0
#> cell_3 1 0.254180594918236 0.31308
#> cell_4 1 0.0473784255600537 0
#> cell_5 1 0 0
#> cell_6 1 0.0630236882520772 0
#call cell-specific mixing score for all
sim_list <- lapply(batch_names, function(name){
sce <- sim_list[[name]]
sce <- cms(sce, k = 30, group = "batch", res_name = "unaligned", n_dim = 3,
cell_min = 4)
}) %>% set_names(batch_names)
#append cms50
sim_list[["batch50"]] <- sce50
The most important parameter to set to calculate cms
is k
, the number of
knn cells to use in the Anderson-Darling test. The optimal choice depends on
the application, as with a small k
focus is on local mixing, while with a
large k
mixing with regard to more global structures is evaluated. If the
overall dataset structure is very heterogeneous with large differences in the
number of cells per celltypes, it might be useful to adapt the number of knn.
This can be done by setting the k_min
parameter to the minimum number of knn
cells to include. Before performing the hypothesis test, cms
will look for
local minima in the overall distance distribution of it’s knn cells. Only
cells within a distance smaller than the first local minimum are then included,
but at least k_min
cells. This can e.g. ensure that only cells from the same
celltype cluster are included without relying on previous clustering algorithms.
For smoothing either k
or if specified k_min
cells are included to get a
weighted mean of cms
-scores. Weights are defined by the reciprocal distance
towards the respective cell, with 1 as weight of the respective cell itself.
Another important parameter is the subspace to use to calculate cell distances.
This can be set using the dim_red
parameter. By default PCA subspace will be
used and calculated if not present. Some data integration methods provide
embeddings of a common subspace instead of “corrected counts”. cms
scores
can be calculated within these by defining them in dim_red
(see 6.1).
In general all reduced dimensional representations can be specified, but only
PCA will be computed automatically, while other methods need to be
precomputed.
An overall summary of cms
can be visualized as histogram. As cms
score are
p.values from hypothesis testing, without any batch effect the p.value
histogram should be flat. An increased number of very small p-values
indicates the presence of a batch-specific bias within data.
#pval hist of cms50
visHist(sce50)
#pval hist sim30
#combine cms results in one matrix
batch_names <- names(sim_list)
cms_mat <- batch_names %>%
map(function(name) sim_list[[name]]$cms.unaligned) %>%
bind_cols() %>% set_colnames(batch_names)
visHist(cms_mat, n_col = 3)
Results of cms
can be visualized in a cell-specific way and alongside any
metadata.
#cms only cms10
sce20 <- sim_list[["batch20"]]
metric_plot <- visMetric(sce20, metric_var = "cms_smooth.unaligned")
#group only cms10
group_plot <- visGroup(sce20, group = "batch")
plot_grid(metric_plot, group_plot, ncol = 2)
#add random celltype assignments as new metadata
sce20[["celltype"]] <- rep(c("CD4+", "CD8+", "CD3"), ncol(sce20)/3) %>%
as.factor
visOverview(sce20, "batch", other_var = "celltype")
Systematic differences (e.g. celltype differences) can be further explored using
visCluster
. Here we do not expect any systematic difference as celltypes were
randomly assigned.
visCluster(sce20, metric_var = "cms.unaligned",
cluster_var = "celltype")
#> Picking joint bandwidth of 0.103
To remove batch effects when integrating different single-cell RNAseq datasets,
a range of methods can be used. The cms
function can be used to evaluate the
effect of these methods, using a cell-specific mixing score. Some of them
(e.g. fastMNN
from the scran package) provide a
“common subspace” with integrated embeddings. Other methods like limma give “batch-corrected data” as results. Both work
as input for cms
.
#MNN - embeddings are stored in the reducedDims slot of sce
reducedDimNames(sce20)
#> [1] "TSNE" "PCA" "MNN"
sce20 <- cms(sce20, k = 30, group = "batch",
dim_red = "MNN", res_name = "MNN", n_dim = 3, cell_min = 4)
# run limma
limma_corrected <- removeBatchEffect(counts(sce20), batch = sce20$batch)
#add corrected counts to sce
assay(sce20, "lim_corrected") <- limma_corrected
#run cms
sce20 <- cms(sce20, k = 30, group = "batch",
assay_name = "lim_corrected", res_name = "limma", n_dim = 3,
cell_min = 4)
names(colData(sce20))
#> [1] "batch" "cms_smooth.unaligned" "cms.unaligned"
#> [4] "celltype" "cms_smooth.MNN" "cms.MNN"
#> [7] "cms_smooth.limma" "cms.limma"
To compare different methods summary plots from visIntegration
(see 6.4) and p-value histograms from visHist
can be used. Local
patterns within single methods can be explored as described above.
# As pvalue histograms
visHist(sce20, metric_prefix = "cms.", n_col = 3)
Here both methods limma and fastMNN
from the scran package flattened the p.value distribution.
So cells are better mixed after batch effect removal.
Besides successful batch “mixing” data integration should also preserve the data’s internal structure and variability without adding new sources of variability or removing underlying structures. Especially for methods that result in “corrected counts” it is important to understand how much of the dataset internal structures are preserved.
ldfDiff
calculates the differences between each cell’s
local density factor before and after data integration (Latecki, Lazarevic, and Pokrajac 2007).
The local density factor is a relative measure of the cell density around a cell
compared to the densities within it’s neighbourhood. Local density factors are
calculated on the same set of k cells from the cell’s kNN before integration.
In an optimal case relative densities (according to the same set of cells)
should not change by integration and the ldfDiff
score should be close to 0.
In general the overall distribution of ldfDiff
should be centered around 0
without long tails.
#Prepare input
# list with single SingleCellExperiment objects
#(Important: List names need to correspond to batch levels! See ?ldfDiff)
sce_pre_list <- list("1" = sce20[,sce20$batch == "1"],
"2" = sce20[,sce20$batch == "2"],
"3" = sce20[,sce20$batch == "3"])
sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20,
group = "batch", k = 70, dim_red = "PCA",
dim_combined = "MNN", assay_pre = "counts",
n_dim = 3, res_name = "MNN")
sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20,
group = "batch", k = 70, dim_red = "PCA",
dim_combined = "PCA", assay_pre = "counts",
assay_combined = "lim_corrected",
n_dim = 3, res_name = "limma")
names(colData(sce20))
#> [1] "batch" "cms_smooth.unaligned" "cms.unaligned"
#> [4] "celltype" "cms_smooth.MNN" "cms.MNN"
#> [7] "cms_smooth.limma" "cms.limma" "diff_ldf.MNN"
#> [10] "diff_ldf.limma"
Results from ldfDiff
can be visualized in a similar way as results from cms
.
#ldfDiff score summarized
visIntegration(sce20, metric_prefix = "diff_ldf")
#> Picking joint bandwidth of 0.0768
ldfDiff
shows a clear difference between both methods.
While limma is able to preserve the batch internal
structure within batches, fastMNN
clearly changes it.
Even if batches are well mixed (see 6.2), fastMNN
does not work
for batch effect removal on these simulated data.
Again this is in line with expectations as due to the small number of genes in
the example data. One of MNN’s assumptions is that batch effects should be much
smaller than biological variation, which does not hold true in this small
example dataset.
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] ggplot2_3.2.0 purrr_0.3.2
#> [3] dplyr_0.8.3 magrittr_1.5
#> [5] limma_3.40.4 cowplot_1.0.0
#> [7] SingleCellExperiment_1.6.0 SummarizedExperiment_1.14.0
#> [9] DelayedArray_0.10.0 BiocParallel_1.18.0
#> [11] matrixStats_0.54.0 Biobase_2.44.0
#> [13] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
#> [15] IRanges_2.18.1 S4Vectors_0.22.0
#> [17] BiocGenerics_0.30.0 CellMixS_1.0.2
#> [19] kSamples_1.2-9 SuppDists_1.1-9.4
#> [21] BiocStyle_2.12.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.1 rsvd_1.0.1
#> [3] lattice_0.20-38 tidyr_0.8.3
#> [5] assertthat_0.2.1 digest_0.6.20
#> [7] R6_2.4.0 plyr_1.8.4
#> [9] ggridges_0.5.1 listarrays_0.2.0
#> [11] evaluate_0.14 pillar_1.4.2
#> [13] zlibbioc_1.30.0 rlang_0.4.0
#> [15] lazyeval_0.2.2 irlba_2.3.3
#> [17] Matrix_1.2-17 rmarkdown_1.14
#> [19] labeling_0.3 BiocNeighbors_1.2.0
#> [21] stringr_1.4.0 RCurl_1.95-4.12
#> [23] munsell_0.5.0 vipor_0.4.5
#> [25] compiler_3.6.1 BiocSingular_1.0.0
#> [27] xfun_0.8 pkgconfig_2.0.2
#> [29] ggbeeswarm_0.6.0 htmltools_0.3.6
#> [31] tidyselect_0.2.5 gridExtra_2.3
#> [33] tibble_2.1.3 GenomeInfoDbData_1.2.1
#> [35] bookdown_0.12 viridisLite_0.3.0
#> [37] withr_2.1.2 crayon_1.3.4
#> [39] bitops_1.0-6 grid_3.6.1
#> [41] gtable_0.3.0 scales_1.0.0
#> [43] stringi_1.4.3 XVector_0.24.0
#> [45] viridis_0.5.1 scater_1.12.2
#> [47] DelayedMatrixStats_1.6.0 tools_3.6.1
#> [49] glue_1.3.1 beeswarm_0.2.3
#> [51] yaml_2.2.0 colorspace_1.4-1
#> [53] BiocManager_1.30.4 knitr_1.23
Büttner, Maren, Zhichao Miao, F. Alexander Wolf, Sarah A. Teichmann, and Fabian J. Theis. 2019. “A test metric for assessing single-cell RNA-seq batch correction.” Nat. Methods 16 (1). Nature Publishing Group:43–49. https://doi.org/10.1038/s41592-018-0254-1.
Latecki, Longin Jan, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. “Outlier Detection with Kernel Density Functions.” In Mach. Learn. Data Min. Pattern Recognit., 61–75. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-73499-4_6.
Scholz, F. W., and M. A. Stephens. 1987. “K-Sample Anderson-Darling Tests.” J. Am. Stat. Assoc. 82 (399):918. https://doi.org/10.2307/2288805.