scAnnotatR is an R package for cell type prediction on single cell RNA-sequencing data. Currently, this package supports data in the forms of a Seurat object or a SingleCellExperiment object.
More information about Seurat object can be found here: More information about SingleCellExperiment object can be found here:
scAnnotatR provides 2 main features:
The scAnnotatR
package can be directly installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
if (!require(scAnnotatR))
::install("scAnnotatR") BiocManager
For more information, see
The scAnnotatR
package comes with several pre-trained models to classify cell types.
# load scAnnotatR into working space
#> Loading required package: Seurat
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
#> version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as
#> the ABI for 'Matrix' may have changed
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#> intersect, t
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:SeuratObject':
#> intersect
#> The following objects are masked from 'package:stats':
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>, basename, cbind, colnames, dirname,,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax,, pmin,
#>, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> findMatches
#> The following objects are masked from 'package:base':
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:sp':
#> %over%
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#> anyMissing, rowMedians
#> Attaching package: 'SummarizedExperiment'
#> The following object is masked from 'package:Seurat':
#> Assays
#> The following object is masked from 'package:SeuratObject':
#> Assays
#> Warning: replacing previous import 'ape::where' by 'dplyr::where' when loading
#> 'scAnnotatR'
The models are stored in the default_models
default_models <-#> loading from cache
#> [1] "B cells" "Plasma cells" "NK"
#> [4] "CD16 NK" "CD56 NK" "T cells"
#> [7] "CD4 T cells" "CD8 T cells" "Treg"
#> [10] "NKT" "ILC" "Monocytes"
#> [13] "CD14 Mono" "CD16 Mono" "DC"
#> [16] "pDC" "Endothelial cells" "LEC"
#> [19] "VEC" "Platelets" "RBC"
#> [22] "Melanocyte" "Schwann cells" "Pericytes"
#> [25] "Mast cells" "Keratinocytes" "alpha"
#> [28] "beta" "delta" "gamma"
#> [31] "acinar" "ductal" "Fibroblasts"
The default_models
object is named a list of classifiers. Each classifier is an instance of the scAnnotatR S4 class
. For example:
'B cells']]
default_models[[#> An object of class scAnnotatR for B cells
#> * 31 marker genes applied: CD38, CD79B, CD74, CD84, RASGRP2, TCF3, SP140, MEF2C, DERL3, CD37, CD79A, POU2AF1, MVK, CD83, BACH2, LY86, CD86, SDC1, CR2, LRMP, VPREB3, IL2RA, BLK, IRF8, FLI1, MS4A1, CD14, MZB1, PTEN, CD19, MME
#> * Predicting probability threshold: 0.5
#> * No parent model
To identify cell types available in a dataset, we need to load the dataset as Seurat or SingleCellExperiment object.
For this vignette, we use a small sample datasets that is available as a Seurat
object as part of the package.
# load the example dataset
tirosh_mel80_example#> An object of class Seurat
#> 91 features across 480 samples within 1 assay
#> Active assay: RNA (91 features, 0 variable features)
#> 2 layers present: counts, data
#> 1 dimensional reduction calculated: umap
The example dataset already contains the clustering results as part of the metadata. This is not necessary for the classification process.
#> orig.ident nCount_RNA nFeature_RNA
#> Cy80_II_CD45_B07_S883_comb SeuratProject 42.46011 8 0
#> Cy80_II_CD45_C09_S897_comb SeuratProject 74.35907 14 0
#> Cy80_II_CD45_H07_S955_comb SeuratProject 42.45392 8 0
#> Cy80_II_CD45_H09_S957_comb SeuratProject 63.47043 12 0
#> Cy80_II_CD45_B11_S887_comb SeuratProject 47.26798 9 0
#> Cy80_II_CD45_D11_S911_comb SeuratProject 69.12167 13 0
#> RNA_snn_res.0.8 seurat_clusters RNA_snn_res.0.5
#> Cy80_II_CD45_B07_S883_comb 4 4 2
#> Cy80_II_CD45_C09_S897_comb 4 4 2
#> Cy80_II_CD45_H07_S955_comb 4 4 2
#> Cy80_II_CD45_H09_S957_comb 4 4 2
#> Cy80_II_CD45_B11_S887_comb 4 4 2
#> Cy80_II_CD45_D11_S911_comb 1 1 1
To launch cell type identification, we simply call the classify_cells
function. A detailed description of all parameters can be found through the function’s help page ?classify_cells
Here we use only 3 classifiers for B cells, T cells and NK cells to reduce computational cost of this vignette. If users want to use all pretrained classifiers on their dataset, cell_types = 'all'
can be used.
classify_cells(classify_obj = tirosh_mel80_example,
seurat.obj <-assay = 'RNA', slot = 'counts',
cell_types = c('B cells', 'NK', 'T cells'),
path_to_models = 'default')
#> loading from cache
cell_types = c('B cells', 'T cells')
classifiers = c(default_models[['B cells']], default_models[['T cells']])
The classify_cells
function returns the input object but with additional columns in the metadata table.
# display the additional metadata fields
c(50:60), c(8:ncol(seurat.obj[[]]))]
seurat.obj[[]][#> B_cells_p B_cells_class NK_p
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb 0.007754246 no 0.4881285
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb 0.999385770 yes 0.4440553
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb 0.998317662 yes 0.4416114
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb 0.997774856 yes 0.4398997
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb 0.998874031 yes 0.4541005
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb 0.999944282 yes 0.4511450
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb 0.015978230 no 0.4841041
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb 0.099311534 no 0.4858084
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb 0.055754074 no 0.4924746
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb 0.048558881 no 0.5002238
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb 0.996979702 yes 0.4994867
#> NK_class T_cells_p T_cells_class
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb no 0.94205232 yes
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb no 0.11269306 no
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb no 0.09834696 no
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb no 0.22256938 no
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb no 0.12903487 no
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb no 0.27242536 no
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb no 0.94929624 yes
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb no 0.93390248 yes
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb no 0.98161289 yes
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb yes 0.96436674 yes
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb no 0.94848597 yes
#> predicted_cell_type
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb B cells
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb NK/T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb B cells/T cells
#> most_probable_cell_type
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb B cells
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb B cells
New columns are:
predicted_cell_type: The predicted cell type, also containing any ambiguous assignments. In these cases, the possible cell types are separated by a “/”
most_probable_cell_type: contains the most probably cell type ignoring any ambiguous assignments.
columns with syntax [celltype]_p
: probability of a cell to belong to a cell type. Unknown cell types are marked as NAs.
The predicted cell types can now simply be visualized using the matching plotting functions. In this example, we use Seurat’s DimPlot
# Visualize the cell types
::DimPlot(seurat.obj, = "most_probable_cell_type") Seurat
With the current number of cell classifiers, we identify cells belonging to 2 cell types (B cells and T cells) and to 2 subtypes of T cells (CD4+ T cells and CD8+ T cells). The other cells (red points) are not among the cell types that can be classified by the predefined classifiers. Hence, they have an empty label.
For a certain cell type, users can also view the prediction probability. Here we show an example of B cell prediction probability:
# Visualize the cell types
::FeaturePlot(seurat.obj, features = "B_cells_p") Seurat
Cells predicted to be B cells with higher probability have darker color, while the lighter color shows lower or even zero probability of a cell to be B cells. For B cell classifier, the threshold for prediction probability is currently at 0.5, which means cells having prediction probability at 0.5 or above will be predicted as B cells.
The automatic cell identification by scAnnotatR matches the traditional cell assignment, ie. the approach based on cell canonical marker expression. Taking a simple example, we use CD19 and CD20 (MS4A1) to identify B cells:
# Visualize the cell types
::FeaturePlot(seurat.obj, features = c("CD19", "MS4A1"), ncol = 2) Seurat
We see that the marker expression of B cells exactly overlaps the B cell prediction made by scAnnotatR.
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
#> locale:
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#> other attached packages:
#> [1] scAnnotatR_1.12.0 SingleCellExperiment_1.28.0
#> [3] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [5] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [7] IRanges_2.40.0 S4Vectors_0.44.0
#> [9] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [11] matrixStats_1.4.1 Seurat_5.1.0
#> [13] SeuratObject_5.0.2 sp_2.1-4
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
#> [4] filelock_1.0.3 tibble_3.2.1 polyclip_1.10-7
#> [7] hardhat_1.4.0 pROC_1.18.5 rpart_4.1.23
#> [10] fastDummies_1.7.4 lifecycle_1.0.4 globals_0.16.3
#> [13] lattice_0.22-6 MASS_7.3-61 magrittr_2.0.3
#> [16] plotly_4.10.4 sass_0.4.9 rmarkdown_2.28
#> [19] jquerylib_0.1.4 yaml_2.3.10 httpuv_1.6.15
#> [22] sctransform_0.4.1 spam_2.11-0 spatstat.sparse_3.1-0
#> [25] reticulate_1.39.0 cowplot_1.1.3 pbapply_1.7-2
#> [28] DBI_1.2.3 RColorBrewer_1.1-3 lubridate_1.9.3
#> [31] abind_1.4-8 zlibbioc_1.52.0 Rtsne_0.17
#> [34] purrr_1.0.2 nnet_7.3-19 rappdirs_0.3.3
#> [37] ipred_0.9-15 lava_1.8.0 GenomeInfoDbData_1.2.13
#> [40] data.tree_1.1.0 ggrepel_0.9.6 irlba_2.3.5.1
#> [43] listenv_0.9.1 spatstat.utils_3.1-0 goftest_1.2-3
#> [46] RSpectra_0.16-2 spatstat.random_3.3-2 fitdistrplus_1.2-1
#> [49] parallelly_1.38.0 leiden_0.4.3.1 codetools_0.2-20
#> [52] DelayedArray_0.32.0 tidyselect_1.2.1 UCSC.utils_1.2.0
#> [55] farver_2.1.2 BiocFileCache_2.14.0 spatstat.explore_3.3-3
#> [58] jsonlite_1.8.9 caret_6.0-94 e1071_1.7-16
#> [61] progressr_0.15.0 ggridges_0.5.6 survival_3.7-0
#> [64] iterators_1.0.14 foreach_1.5.2 tools_4.4.1
#> [67] ica_1.0-3 Rcpp_1.0.13 glue_1.8.0
#> [70] prodlim_2024.06.25 gridExtra_2.3 SparseArray_1.6.0
#> [73] xfun_0.48 dplyr_1.1.4 withr_3.0.2
#> [76] BiocManager_1.30.25 fastmap_1.2.0 fansi_1.0.6
#> [79] digest_0.6.37 timechange_0.3.0 R6_2.5.1
#> [82] mime_0.12 colorspace_2.1-1 scattermore_1.2
#> [85] tensor_1.5 spatstat.data_3.1-2 RSQLite_2.3.7
#> [88] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3
#> [91] data.table_1.16.2 recipes_1.1.0 class_7.3-22
#> [94] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.6.0
#> [97] ModelMetrics_1.2.2.2 uwot_0.2.2 pkgconfig_2.0.3
#> [100] gtable_0.3.6 timeDate_4041.110 blob_1.2.4
#> [103] lmtest_0.9-40 XVector_0.46.0 htmltools_0.5.8.1
#> [106] dotCall64_1.2 scales_1.3.0 png_0.1-8
#> [109] gower_1.0.1 spatstat.univar_3.0-1 knitr_1.48
#> [112] reshape2_1.4.4 nlme_3.1-166 curl_5.2.3
#> [115] proxy_0.4-27 cachem_1.1.0 zoo_1.8-12
#> [118] stringr_1.5.1 BiocVersion_3.20.0 KernSmooth_2.23-24
#> [121] parallel_4.4.1 miniUI_0.1.1.1 AnnotationDbi_1.68.0
#> [124] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
#> [127] RANN_2.6.2 promises_1.3.0 dbplyr_2.5.0
#> [130] xtable_1.8-4 cluster_2.1.6 evaluate_1.0.1
#> [133] cli_3.6.3 compiler_4.4.1 rlang_1.1.4
#> [136] crayon_1.5.3 future.apply_1.11.3 labeling_0.4.3
#> [139] plyr_1.8.9 stringi_1.8.4 viridisLite_0.4.2
#> [142] deldir_2.0-4 munsell_0.5.1 Biostrings_2.74.0
#> [145] lazyeval_0.2.2 spatstat.geom_3.3-3 Matrix_1.7-1
#> [148] RcppHNSW_0.6.0 patchwork_1.3.0 bit64_4.5.2
#> [151] future_1.34.0 ggplot2_3.5.1 KEGGREST_1.46.0
#> [154] shiny_1.9.1 highr_0.11 AnnotationHub_3.14.0
#> [157] kernlab_0.9-33 ROCR_1.0-11 igraph_2.1.1
#> [160] memoise_2.0.1 bslib_0.8.0 bit_4.5.0
#> [163] ape_5.8