Introduction to scClassifR

Introduction

scClassifR is an R package for cell type prediction on single cell RNA-sequencing data. Currently, this package supports data in the forms of a Seurat object or a SingleCellExperiment object.

More information about Seurat object can be found here: https://satijalab.org/seurat/ More information about SingleCellExperiment object can be found here: https://osca.bioconductor.org/

scClassifR provides 2 main features:

A set of pretrained and robust classifiers for basic immune cells. See the section below.
A user-friendly and fully customizable framework to train new classification models. These models can then be easily saved and reused in the future. Details usage of this framework is explained in vignettes 2 and 3.

Installation

The scClassifR package can be directly installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!require(scClassifR))
  BiocManager::install("scClassifR")

For more information, see https://bioconductor.org/install/.

Included models

The scClassifR package comes with several pre-trained models to classify cell types.

# load scClassifR into working space
library(scClassifR)
#> Loading required package: Seurat
#> Attaching SeuratObject
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
#> 
#> Attaching package: 'SummarizedExperiment'
#> The following object is masked from 'package:SeuratObject':
#> 
#>     Assays
#> The following object is masked from 'package:Seurat':
#> 
#>     Assays
#> Warning: Package 'scClassifR' is deprecated and will be removed from
#>   Bioconductor version 3.15

The models are stored in the default_models object:

data("default_models")
names(default_models)
#> [1] "B cells" "T cells" "NK"

The default_models object is named a list of classifiers. Each classifier is an instance of the scClassifR S4 class. For example:

default_models[['B cells']]
#> An object of class scClassifR for B cells 
#> * 31 features applied: CD38, CD79B, CD74, CD84, RASGRP2, TCF3, SP140, MEF2C, DERL3, CD37, CD79A, POU2AF1, MVK, CD83, BACH2, LY86, CD86, SDC1, CR2, LRMP, VPREB3, IL2RA, BLK, IRF8, FLI1, MS4A1, CD14, MZB1, PTEN, CD19, MME 
#> * Predicting probability threshold: 0.5 
#> * No parent model

Basic pipeline to identify cell types in a scRNA-seq dataset using scClassifR

Preparing the data

To identify cell types available in a dataset, we need to load the dataset as Seurat or SingleCellExperiment object.

For this vignette, we use a small sample datasets that is available as a Seurat object as part of the package.

# load the example dataset
data("tirosh_mel80_example")
tirosh_mel80_example
#> An object of class Seurat 
#> 78 features across 480 samples within 1 assay 
#> Active assay: RNA (78 features, 34 variable features)
#>  1 dimensional reduction calculated: umap

The example dataset already contains the clustering results as part of the metadata. This is not necessary for the classification process.

head(tirosh_mel80_example[[]])
#>                               orig.ident nCount_RNA nFeature_RNA percent.mt
#> Cy80_II_CD45_B07_S883_comb SeuratProject   42.46011            8          0
#> Cy80_II_CD45_C09_S897_comb SeuratProject   74.35907           14          0
#> Cy80_II_CD45_H07_S955_comb SeuratProject   42.45392            8          0
#> Cy80_II_CD45_H09_S957_comb SeuratProject   63.47043           12          0
#> Cy80_II_CD45_B11_S887_comb SeuratProject   47.26798            9          0
#> Cy80_II_CD45_D11_S911_comb SeuratProject   69.12167           13          0
#>                            RNA_snn_res.0.8 seurat_clusters RNA_snn_res.0.5
#> Cy80_II_CD45_B07_S883_comb               4               4               2
#> Cy80_II_CD45_C09_S897_comb               4               4               2
#> Cy80_II_CD45_H07_S955_comb               4               4               2
#> Cy80_II_CD45_H09_S957_comb               4               4               2
#> Cy80_II_CD45_B11_S887_comb               4               4               2
#> Cy80_II_CD45_D11_S911_comb               1               1               1

Cell classification

To launch cell type identification, we simply call the classify_cells function. A detailed description of all parameters can be found through the function’s help page ?classify_cells.

Here we use only 3 classifiers for B cells, T cells and NK cells to reduce computational cost of this vignette. If users want to use all pretrained classifiers on their dataset, cell_types = 'all' can be used.

seurat.obj <- classify_cells(classify_obj = tirosh_mel80_example, 
                             seurat_assay = 'RNA', seurat_slot = 'data',
                             cell_types = c('B cells', 'NK', 'T cells'), 
                             path_to_models = 'default')

Parameters

The option cell_types = ‘all’ tells the function to use all available cell classification models. Alternatively, we can limit the identifiable cell types:
- by specifying: cell_types = c('B cells', 'T cells')
- or by indicating the applicable classifier using the classifiers option: classifiers = c(default_models[['B cells']], default_models[['T cells']])
The option path_to_models = ‘default’ is to automatically use the package-integrated pretrained models (without loading the models into the current working space). This option can be used to load a local database instead. For more details see the vignettes on training your own classifiers.

Result interpretation

The classify_cells function returns the input object but with additional columns in the metadata table.

# display the additional metadata fields
seurat.obj[[]][c(50:60), c(8:16)]
#>                                            B_cells_p B_cells_class      NK_p
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb       0.007968287            no 0.4776882
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb 0.999938756           yes 0.5227294
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb 0.995998043           yes 0.4193178
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb 0.998736176           yes 0.4623257
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb 0.999724425           yes 0.5028934
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb       0.996084449           yes 0.3515119
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb       0.017356301            no 0.4776076
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb 0.112374511            no 0.4697555
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb       0.038732890            no 0.4372554
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb       0.043200589            no 0.4639698
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb       0.989798761           yes 0.4548053
#>                                          NK_class T_cells_p T_cells_class
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb             no 0.9297574           yes
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb      yes 0.1234063            no
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb       no 0.1024498            no
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb       no 0.2495409            no
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb      yes 0.1393775            no
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb             no 0.2489775            no
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb             no 0.9393191           yes
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb       no 0.9155632           yes
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb             no 0.9615349           yes
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb             no 0.9426831           yes
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb             no 0.9244885           yes
#>                                          predicted_cell_type
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb                   T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb          B cells/NK
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb             B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb             B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb          B cells/NK
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb                   B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb                   T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb             T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb                   T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb                   T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb           B cells/T cells
#>                                          most_probable_cell_type     clust_pred
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb                       T cells 86.42% T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb                 B cells   100% B cells
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb                 B cells   100% B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb                 B cells   100% B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb                 B cells   100% B cells
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb                       B cells   100% B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb                       T cells 86.42% T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb                 T cells 86.42% T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb                       T cells 86.42% T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb                       T cells 86.42% T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb                       B cells   100% B cells

New columns are:

predicted_cell_type: The predicted cell type, also containing any ambiguous assignments. In these cases, the possible cell types are separated by a “/”
most_probable_cell_type: contains the most probably cell type ignoring any ambiguous assignments.
columns with syntax [celltype]_p: probability of a cell to belong to a cell type. Unknown cell types are marked as NAs.

Result visualization

The predicted cell types can now simply be visualized using the matching plotting functions. In this example, we use Seurat’s DimPlot function:

# Visualize the cell types
Seurat::DimPlot(seurat.obj, group.by = "most_probable_cell_type")

With the current number of cell classifiers, we identify cells belonging to 2 cell types (B cells and T cells) and to 2 subtypes of T cells (CD4+ T cells and CD8+ T cells). The other cells (red points) are not among the cell types that can be classified by the predefined classifiers. Hence, they have an empty label.

For a certain cell type, users can also view the prediction probability. Here we show an example of B cell prediction probability:

# Visualize the cell types
Seurat::FeaturePlot(seurat.obj, features = "B_cells_p")

Cells predicted to be B cells with higher probability have darker color, while the lighter color shows lower or even zero probability of a cell to be B cells. For B cell classifier, the threshold for prediction probability is currently at 0.5, which means cells having prediction probability at 0.5 or above will be predicted as B cells.

The automatic cell identification by scClassifR matches the traditional cell assignment, ie. the approach based on cell canonical marker expression. Taking a simple example, we use CD19 and CD20 (MS4A1) to identify B cells:

# Visualize the cell types
Seurat::FeaturePlot(seurat.obj, features = c("CD19", "MS4A1"), ncol = 2)

We see that the marker expression of B cells exactly overlaps the B cell prediction made by scClassifR.

Session Info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scClassifR_1.2.0            SingleCellExperiment_1.16.0
#>  [3] SummarizedExperiment_1.24.0 Biobase_2.54.0             
#>  [5] GenomicRanges_1.46.0        GenomeInfoDb_1.30.0        
#>  [7] IRanges_2.28.0              S4Vectors_0.32.0           
#>  [9] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
#> [11] matrixStats_0.61.0          SeuratObject_4.0.2         
#> [13] Seurat_4.0.5               
#> 
#> loaded via a namespace (and not attached):
#>   [1] plyr_1.8.6             igraph_1.2.7           lazyeval_0.2.2        
#>   [4] splines_4.1.1          listenv_0.8.0          scattermore_0.7       
#>   [7] ggplot2_3.3.5          digest_0.6.28          foreach_1.5.1         
#>  [10] htmltools_0.5.2        fansi_0.5.0            magrittr_2.0.1        
#>  [13] tensor_1.5             cluster_2.1.2          ROCR_1.0-11           
#>  [16] recipes_0.1.17         globals_0.14.0         gower_0.2.2           
#>  [19] spatstat.sparse_2.0-0  colorspace_2.0-2       ggrepel_0.9.1         
#>  [22] xfun_0.27              dplyr_1.0.7            crayon_1.4.1          
#>  [25] RCurl_1.98-1.5         jsonlite_1.7.2         spatstat.data_2.1-0   
#>  [28] survival_3.2-13        zoo_1.8-9              iterators_1.0.13      
#>  [31] ape_5.5                glue_1.4.2             polyclip_1.10-0       
#>  [34] gtable_0.3.0           ipred_0.9-12           zlibbioc_1.40.0       
#>  [37] XVector_0.34.0         leiden_0.3.9           DelayedArray_0.20.0   
#>  [40] kernlab_0.9-29         future.apply_1.8.1     abind_1.4-5           
#>  [43] scales_1.1.1           data.tree_1.0.0        DBI_1.1.1             
#>  [46] miniUI_0.1.1.1         Rcpp_1.0.7             viridisLite_0.4.0     
#>  [49] xtable_1.8-4           reticulate_1.22        spatstat.core_2.3-0   
#>  [52] proxy_0.4-26           lava_1.6.10            prodlim_2019.11.13    
#>  [55] htmlwidgets_1.5.4      httr_1.4.2             RColorBrewer_1.1-2    
#>  [58] ellipsis_0.3.2         ica_1.0-2              farver_2.1.0          
#>  [61] pkgconfig_2.0.3        nnet_7.3-16            sass_0.4.0            
#>  [64] uwot_0.1.10            deldir_1.0-6           utf8_1.2.2            
#>  [67] caret_6.0-90           labeling_0.4.2         tidyselect_1.1.1      
#>  [70] rlang_0.4.12           reshape2_1.4.4         later_1.3.0           
#>  [73] munsell_0.5.0          tools_4.1.1            generics_0.1.1        
#>  [76] ggridges_0.5.3         evaluate_0.14          stringr_1.4.0         
#>  [79] fastmap_1.1.0          yaml_2.2.1             goftest_1.2-3         
#>  [82] ModelMetrics_1.2.2.2   knitr_1.36             fitdistrplus_1.1-6    
#>  [85] purrr_0.3.4            RANN_2.6.1             pbapply_1.5-0         
#>  [88] future_1.22.1          nlme_3.1-153           mime_0.12             
#>  [91] compiler_4.1.1         plotly_4.10.0          png_0.1-7             
#>  [94] e1071_1.7-9            spatstat.utils_2.2-0   tibble_3.1.5          
#>  [97] bslib_0.3.1            stringi_1.7.5          highr_0.9             
#> [100] lattice_0.20-45        Matrix_1.3-4           vctrs_0.3.8           
#> [103] pillar_1.6.4           lifecycle_1.0.1        spatstat.geom_2.3-0   
#> [106] lmtest_0.9-38          jquerylib_0.1.4        RcppAnnoy_0.0.19      
#> [109] data.table_1.14.2      cowplot_1.1.1          bitops_1.0-7          
#> [112] irlba_2.3.3            httpuv_1.6.3           patchwork_1.1.1       
#> [115] R6_2.5.1               promises_1.2.0.1       KernSmooth_2.23-20    
#> [118] gridExtra_2.3          parallelly_1.28.1      codetools_0.2-18      
#> [121] MASS_7.3-54            assertthat_0.2.1       withr_2.4.2           
#> [124] sctransform_0.3.2      GenomeInfoDbData_1.2.7 mgcv_1.8-38           
#> [127] parallel_4.1.1         grid_4.1.1             rpart_4.1-15          
#> [130] timeDate_3043.102      class_7.3-19           tidyr_1.1.4           
#> [133] rmarkdown_2.11         Rtsne_0.15             pROC_1.18.0           
#> [136] lubridate_1.8.0        shiny_1.7.1