1 Introduction

In this vignette, we provide an overview of the basic functionality and usage of the scds package, which interfaces with SingleCellExperiment objects.

2 Installation

Install the scds package using Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scds", version = "3.9")

Or from github:

library(devtools)
devtools::install_github('kostkalab/scds')

3 Quick start

scds takes as input a SingleCellExperiment object (see here SingleCellExperiment), where raw counts are stored in a counts assay, i.e. assay(sce,"counts"). An example dataset created by sub-sampling the cell-hashing cell-lines data set (see https://satijalab.org/seurat/hashing_vignette.html) is included with the package and accessible via data("sce").Note that scds is designed to workd with larger datasets, but for the purposes of this vignette, we work with a smaller example dataset. We apply scds to this data and compare/visualize reasults:

3.1 Example data set

Get example data set provided with the package.

library(scds)
library(scater)
library(rsvd)
library(Rtsne)
library(cowplot)
set.seed(30519)
data("sce_chcl")
sce = sce_chcl #- less typing
dim(sce)

## [1] 2000 2000

We see it contains 2,000 genes and 2,000 cells, 216 of which are identified as doublets:

table(sce$hto_classification_global)

## 
##  Doublet Negative  Singlet 
##      216       83     1701

We can visualize cells/doublets after projecting into two dimensions:

logcounts(sce) = log1p(counts(sce))
vrs            = apply(logcounts(sce),1,var)
pc             = rpca(t(logcounts(sce)[order(vrs,decreasing=TRUE)[1:100],]))
ts             = Rtsne(pc$x[,1:10],verb=FALSE)

reducedDim(sce,"tsne") = ts$Y; rm(ts,vrs,pc)
plotReducedDim(sce,"tsne",col="hto_classification_global")

3.2 Computational doublet annotation

We now run the scds doublet annotation approaches. Briefly, we identify doublets in two complementary ways: cxds is based on co-expression of gene pairs and works with absence/presence calls only, while bcds uses the full count information and a binary classification approach using artificially generated doublets. cxds_bcds_hybrid combines both approaches, for more details please consult (this manuscript). Each of the three methods returns a doublet score, with higher scores indicating more “doublet-like” barcodes.

#- Annotate doublet using co-expression based doublet scoring:
sce = cxds(sce,retRes = TRUE)
sce = bcds(sce,retRes = TRUE,verb=TRUE)

## [1]  train-error:0.154125+0.007228   test-error:0.216500+0.005557 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 2 rounds.
## 
## [2]  train-error:0.140375+0.001816   test-error:0.201500+0.004430 
## [3]  train-error:0.119938+0.004390   test-error:0.190250+0.009533 
## [4]  train-error:0.112688+0.003110   test-error:0.183000+0.006964 
## [5]  train-error:0.102875+0.003684   test-error:0.175000+0.007542 
## [6]  train-error:0.095625+0.003325   test-error:0.176000+0.008958 
## [7]  train-error:0.087875+0.004939   test-error:0.174250+0.012811 
## [8]  train-error:0.084438+0.003652   test-error:0.173000+0.008573 
## [9]  train-error:0.079250+0.003421   test-error:0.171250+0.007027 
## [10] train-error:0.073250+0.002707   test-error:0.172500+0.006937 
## [11] train-error:0.067937+0.002489   test-error:0.171250+0.004472 
## Stopping. Best iteration:
## [9]  train-error:0.079250+0.003421   test-error:0.171250+0.007027
## 
## [1]  train-error:0.162500 
## Will train until train_error hasn't improved in 2 rounds.
## 
## [2]  train-error:0.137500 
## [3]  train-error:0.118250 
## [4]  train-error:0.106750 
## [5]  train-error:0.103500

sce = cxds_bcds_hybrid(sce)
par(mfcol=c(1,3))
boxplot(sce$cxds_score   ~ sce$doublet_true_labels, main="cxds")
boxplot(sce$bcds_score   ~ sce$doublet_true_labels, main="bcds")
boxplot(sce$hybrid_score ~ sce$doublet_true_labels, main="hybrid")

3.3 Visualizing gene pairs

For cxds we can identify and visualize gene pairs driving doublet annoataions, with the expectation that the two genes in a pair might mark different types of cells (see manuscript). In the following we look at the top three pairs, each gene pair is a row in the plot below:

top3 = metadata(sce)$cxds_topPairs[1:3,]
rs   = rownames(sce)
hb   = rowData(sce)$cxds_hvg_bool
ho   = rowData(sce)$cxds_hvg_ordr[hb]
hgs  = rs[ho]

l1 =  ggdraw() + draw_text("Pair 1", x = 0.5, y = 0.5)
p1 = plotReducedDim(sce,"tsne",col=hgs[top3[1,1]])
p2 = plotReducedDim(sce,"tsne",col=hgs[top3[1,2]])

l2 =  ggdraw() + draw_text("Pair 2", x = 0.5, y = 0.5)
p3 = plotReducedDim(sce,"tsne",col=hgs[top3[2,1]])
p4 = plotReducedDim(sce,"tsne",col=hgs[top3[2,2]])

l3 = ggdraw() + draw_text("Pair 3", x = 0.5, y = 0.5)
p5 = plotReducedDim(sce,"tsne",col=hgs[top3[3,1]])
p6 = plotReducedDim(sce,"tsne",col=hgs[top3[3,2]])

plot_grid(l1,p1,p2,l2,p3,p4,l3,p5,p6,ncol=3, rel_widths = c(1,2,2))

4 Session Info

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cowplot_0.9.4               Rtsne_0.15                 
##  [3] rsvd_1.0.0                  scater_1.12.0              
##  [5] ggplot2_3.1.1               SingleCellExperiment_1.6.0 
##  [7] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
##  [9] BiocParallel_1.18.0         matrixStats_0.54.0         
## [11] Biobase_2.44.0              GenomicRanges_1.36.0       
## [13] GenomeInfoDb_1.20.0         IRanges_2.18.0             
## [15] S4Vectors_0.22.0            BiocGenerics_0.30.0        
## [17] scds_1.0.0                  BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1               lattice_0.20-38         
##  [3] assertthat_0.2.1         digest_0.6.18           
##  [5] R6_2.4.0                 plyr_1.8.4              
##  [7] evaluate_0.13            pillar_1.3.1            
##  [9] zlibbioc_1.30.0          rlang_0.3.4             
## [11] lazyeval_0.2.2           data.table_1.12.2       
## [13] irlba_2.3.3              Matrix_1.2-17           
## [15] rmarkdown_1.12           labeling_0.3            
## [17] BiocNeighbors_1.2.0      stringr_1.4.0           
## [19] RCurl_1.95-4.12          munsell_0.5.0           
## [21] compiler_3.6.0           vipor_0.4.5             
## [23] BiocSingular_1.0.0       xfun_0.6                
## [25] pkgconfig_2.0.2          ggbeeswarm_0.6.0        
## [27] htmltools_0.3.6          tidyselect_0.2.5        
## [29] tibble_2.1.1             gridExtra_2.3           
## [31] GenomeInfoDbData_1.2.1   bookdown_0.9            
## [33] viridisLite_0.3.0        crayon_1.3.4            
## [35] dplyr_0.8.0.1            withr_2.1.2             
## [37] bitops_1.0-6             grid_3.6.0              
## [39] gtable_0.3.0             magrittr_1.5            
## [41] scales_1.0.0             stringi_1.4.3           
## [43] XVector_0.24.0           viridis_0.5.1           
## [45] DelayedMatrixStats_1.6.0 xgboost_0.82.1          
## [47] tools_3.6.0              glue_1.3.1              
## [49] beeswarm_0.2.3           purrr_0.3.2             
## [51] yaml_2.2.0               colorspace_1.4-1        
## [53] BiocManager_1.30.4       knitr_1.22

scds:single cell doublet scoring: In-silico doublet annotation for single cell RNA sequencing data

2 May 2019

Contents