scone()
library(SingleCellExperiment)
library(splatter)
library(scater)
library(cluster)
library(scone)
PsiNorm is a scalable between-sample normalization for single cell RNA-seq count data based on the power-law Pareto type I distribution. It can be demonstrated that the Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate can be obtained for each cell independently. PsiNorm computes the shape parameter for each cellular sample and then uses it as multiplicative size factor to normalize the data. The final goal of the transformation is to align the gene expression distribution especially for those genes characterised by high expression. Note that, similar to other global scaling methods, our method does not remove batch effects, which can be dealt with downstream tools.
To evaluate the ability of PsiNorm to remove technical bias and reveal the true cell similarity structure, we used both an unsupervised and a supervised approach. We first simulate a scRNA-seq experiment with four known clusters using the splatter Bioconductor package. Then in the unsupervised approach, we i) reduce dimentionality using PCA, ii) identify clusters using the clara partitional method and then we iii) computed the Adjusted Rand Index (ARI) to compare the known and the estimated partition.
In the supervised approach, we i) reduce dimentionality using PCA, and we ii) compute the silhouette index of the known partition in the reduced dimensional space.
If you use PsiNorm
in publications, please cite the following article:
Borella, M., Martello, G., Risso, D., & Romualdi, C. (2021). PsiNorm: a scalable normalization for single-cell RNA-seq data. bioRxiv. https://doi.org/10.1101/2021.04.07.438822.
We simulate a matrix of counts with 2000 cellular samples and 10000 genes with splatter.
set.seed(1234)
params <- newSplatParams()
N=2000
sce <- splatSimulateGroups(params, batchCells=N, lib.loc=12,
group.prob = rep(0.25,4),
de.prob = 0.2, de.facLoc = 0.06,
verbose = FALSE)
sce
is a SingleCellExperiment object with a single batch and four different cellular groups.
To visualize the data we used the first two Principal Components estimated starting from the raw log-count matrix.
set.seed(1234)
assay(sce, "lograwcounts") <- log1p(counts(sce))
sce <- runPCA(sce, exprs_values="lograwcounts", scale=TRUE, ncomponents = 2)
plotPCA(sce, colour_by="Group")
To normalize the raw counts we used the PsiNorm normalization and we visualized the data using the first two principal components.
sce<-PsiNorm(sce)
sce<-logNormCounts(sce)
head(sizeFactors(sce))
#> [1] 1.0422137 0.9635032 1.0897147 1.0121574 0.9670038 1.0882353
Note that running the PsiNorm
function computes a set of size factors that are added to the SingleCellExperiment object.
The logNormCounts
function can be then used to normalize the data by multiplying the raw counts and the size factors.
set.seed(1234)
sce<-runPCA(sce, exprs_values="logcounts", scale=TRUE, name = "PsiNorm_PCA",
ncomponents = 2)
plotReducedDim(sce, dimred = "PsiNorm_PCA", colour_by = "Group")
We can appreciate from the plot that PsiNorm allows a better separation among known cellular groups.
We calculate ARI of both raw counts and PsiNorm normalized counts after PCA dimension reduction and \(clara\) clustering (with \(k\) equal to the simulated number of clusters); higher the ARI, better the normalization.
groups<-cluster::clara(reducedDim(sce, "PCA"), k=nlevels(sce$Group))
a<-paste("ARI from raw counts:",
round(mclust::adjustedRandIndex(groups$clustering, sce$Group),
digits = 3))
groups<-cluster::clara(reducedDim(sce, "PsiNorm_PCA"), k=nlevels(sce$Group))
b<-paste("ARI from PsiNorm normalized data:",
round(mclust::adjustedRandIndex(groups$clustering, sce$Group),
digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)
ARI from raw counts: 0.347 |
ARI from PsiNorm normalized data: 0.959 |
Pareto normalization considerably increases the ARI index.
We calculate the Silhouette index of both raw counts and PsiNorm normalized counts after tSNE dimension reduction exploiting known simulated clusters; higher the Silhouette, better the normalization.
dist<-daisy(reducedDim(sce, "PCA"))
dist<-as.matrix(dist)
a<-paste("Silhouette from raw counts:", round(summary(
silhouette(x=as.numeric(as.factor(sce$Group)),
dmatrix = dist))$avg.width, digits = 3))
dist<-daisy(reducedDim(sce, "PsiNorm_PCA"))
dist<-as.matrix(dist)
b<-paste("Silhouette from PsiNorm normalized data:", round(summary(
silhouette(x=as.numeric(as.factor(sce$Group)),
dmatrix = dist))$avg.width, digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)
Silhouette from raw counts: 0.128 |
Silhouette from PsiNorm normalized data: 0.686 |
Pareto normalization considerably increases the Silhouette index.
To check if PsiNorm is able to capture technical noise and remove unwanted variation within a dataset (due for instance to differences in sequencing depth), we check whether the first two PCs are capturing technical variance. We computed the maximum correlation obtained between PC1 and PC2 and cell sequencing depths; a higher correlation indicates that the normalization was not able to properly remove noise.
set.seed(4444)
PCA<-reducedDim(sce, "PCA")
PCAp<-reducedDim(sce, "PsiNorm_PCA")
depth<-apply(counts(sce), 2, sum)
a<-paste("The Correlation with the raw data is:",
round(abs(max(cor(PCA[,1], depth), cor(PCA[,2], depth))), digits=3))
b<-paste("The Correlation with the PsiNorm normalized data is:",
round(abs(max(cor(PCAp[,1], depth), cor(PCAp[,2], depth))), digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)
The Correlation with the raw data is: 0.926 |
The Correlation with the PsiNorm normalized data is: 0.014 |
Our results demonstrate that the correlation significantly decreases after the PsiNorm normalization.
scone()
As for other normalizations, scone
includes a wrapper function to use PsiNorm in the SCONE evaluation framework.
See Section 3.2 of the “Introduction to SCONE” vignette for an example on how to use PsiNorm within the main scone()
function.
The PsiNorm normalization method can be used as a replacement for Seurat’s default normalization methods. To do so, we need to first normalize the data stored in a SingleCellExperiment
object and then coerce that object to a Seurat object. This can be done with the as.Seurat
function provided in the Seurat
package (tested with Seurat 4.0.3).
library(Seurat)
sce <- PsiNorm(sce)
sce <- logNormCounts(sce)
seu <- as.Seurat(sce)
From this point on, one can continue the analysis with the recommended Seurat workflow, but using PsiNorm log-normalized data.
Thanks to the HDF5Array
and DelayedArray
packages, PsiNorm can be applied directly to HDF5-backed matrices without the need for the user to change the code. As an example, we use a dataset from the TENxPBMCData
package, which provides several SingleCellExperiment objects with HDF5-backed matrices as their assays.
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
sce
#> class: SingleCellExperiment
#> dim: 33694 4340
#> metadata(0):
#> assays(1): counts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#> ENSG00000268674
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
In particular, we use the pbmc4k
dataset that contains about 4,000 PBMCs from a healthy donor.
The counts
assay of this object is a DelayedMatrix
backed by a HDF5 file. Hence, the data are store on disk (out of memory).
counts(sce)
#> <33694 x 4340> sparse matrix of class DelayedMatrix and type "integer":
#> [,1] [,2] [,3] [,4] ... [,4337] [,4338] [,4339]
#> ENSG00000243485 0 0 0 0 . 0 0 0
#> ENSG00000237613 0 0 0 0 . 0 0 0
#> ENSG00000186092 0 0 0 0 . 0 0 0
#> ENSG00000238009 0 0 0 0 . 0 0 0
#> ENSG00000239945 0 0 0 0 . 0 0 0
#> ... . . . . . . . .
#> ENSG00000277856 0 0 0 0 . 0 0 0
#> ENSG00000275063 0 0 0 0 . 0 0 0
#> ENSG00000271254 0 0 0 0 . 0 0 0
#> ENSG00000277475 0 0 0 0 . 0 0 0
#> ENSG00000268674 0 0 0 0 . 0 0 0
#> [,4340]
#> ENSG00000243485 0
#> ENSG00000237613 0
#> ENSG00000186092 0
#> ENSG00000238009 0
#> ENSG00000239945 0
#> ... .
#> ENSG00000277856 0
#> ENSG00000275063 0
#> ENSG00000271254 0
#> ENSG00000277475 0
#> ENSG00000268674 0
seed(counts(sce))
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/25ef00280f4638_1611"
#>
#> Slot "name":
#> [1] "/counts"
#>
#> Slot "as_sparse":
#> [1] TRUE
#>
#> Slot "type":
#> [1] NA
#>
#> Slot "dim":
#> [1] 33694 4340
#>
#> Slot "chunkdim":
#> [1] 512 66
#>
#> Slot "first_val":
#> [1] 0
Thanks to the DelayedArray
framework, we can apply PsiNorm using the same code that we have used in the case of in-memory data.
sce<-PsiNorm(sce)
sce<-logNormCounts(sce)
sce
#> class: SingleCellExperiment
#> dim: 33694 4340
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#> ENSG00000268674
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(12): Sample Barcode ... Date_published sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
Note that logNormCounts
is a delayed operation, meaning that the actual log-normalized values will be computed only when needed by the user. In other words, the data are still stored out-of-memory as the original count matrix and the log-normalized data will be computed only when logcounts(sce)
is realized into memory.
seed(logcounts(sce))
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/25ef00280f4638_1611"
#>
#> Slot "name":
#> [1] "/counts"
#>
#> Slot "as_sparse":
#> [1] TRUE
#>
#> Slot "type":
#> [1] NA
#>
#> Slot "dim":
#> [1] 33694 4340
#>
#> Slot "chunkdim":
#> [1] 512 66
#>
#> Slot "first_val":
#> [1] 0
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] TENxPBMCData_1.11.1 HDF5Array_1.22.0
#> [3] rhdf5_2.38.0 DelayedArray_0.20.0
#> [5] Matrix_1.3-4 scone_1.18.0
#> [7] cluster_2.1.2 scater_1.22.0
#> [9] ggplot2_3.3.5 scuttle_1.4.0
#> [11] splatter_1.18.0 SingleCellExperiment_1.16.0
#> [13] SummarizedExperiment_1.24.0 Biobase_2.54.0
#> [15] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
#> [17] IRanges_2.28.0 S4Vectors_0.32.0
#> [19] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
#> [21] matrixStats_0.61.0 BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 R.utils_2.11.0
#> [3] tidyselect_1.1.1 RSQLite_2.2.8
#> [5] AnnotationDbi_1.56.0 grid_4.1.1
#> [7] BiocParallel_1.28.0 munsell_0.5.0
#> [9] ScaledMatrix_1.2.0 withr_2.4.2
#> [11] colorspace_2.0-2 filelock_1.0.2
#> [13] highr_0.9 knitr_1.36
#> [15] rstudioapi_0.13 robustbase_0.93-9
#> [17] bayesm_3.1-4 labeling_0.4.2
#> [19] GenomeInfoDbData_1.2.7 hwriter_1.3.2
#> [21] bit64_4.0.5 farver_2.1.0
#> [23] vctrs_0.3.8 generics_0.1.1
#> [25] xfun_0.27 BiocFileCache_2.2.0
#> [27] EDASeq_2.28.0 diptest_0.76-0
#> [29] R6_2.5.1 ggbeeswarm_0.6.0
#> [31] rsvd_1.0.5 locfit_1.5-9.4
#> [33] flexmix_2.3-17 bitops_1.0-7
#> [35] rhdf5filters_1.6.0 cachem_1.0.6
#> [37] assertthat_0.2.1 promises_1.2.0.1
#> [39] BiocIO_1.4.0 scales_1.1.1
#> [41] nnet_7.3-16 beeswarm_0.4.0
#> [43] gtable_0.3.0 beachmat_2.10.0
#> [45] RUVSeq_1.28.0 rlang_0.4.12
#> [47] systemfonts_1.0.3 splines_4.1.1
#> [49] rtracklayer_1.54.0 hexbin_1.28.2
#> [51] checkmate_2.0.0 BiocManager_1.30.16
#> [53] yaml_2.2.1 GenomicFeatures_1.46.0
#> [55] backports_1.2.1 httpuv_1.6.3
#> [57] tensorA_0.36.2 tools_4.1.1
#> [59] bookdown_0.24 ellipsis_0.3.2
#> [61] gplots_3.1.1 kableExtra_1.3.4
#> [63] jquerylib_0.1.4 RColorBrewer_1.1-2
#> [65] Rcpp_1.0.7 sparseMatrixStats_1.6.0
#> [67] progress_1.2.2 zlibbioc_1.40.0
#> [69] purrr_0.3.4 RCurl_1.98-1.5
#> [71] prettyunits_1.1.1 viridis_0.6.2
#> [73] cowplot_1.1.1 ggrepel_0.9.1
#> [75] magrittr_2.0.1 RSpectra_0.16-0
#> [77] magick_2.7.3 aroma.light_3.24.0
#> [79] xtable_1.8-4 mime_0.12
#> [81] hms_1.1.1 evaluate_0.14
#> [83] XML_3.99-0.8 jpeg_0.1-9
#> [85] mclust_5.4.7 gridExtra_2.3
#> [87] compiler_4.1.1 biomaRt_2.50.0
#> [89] tibble_3.1.5 KernSmooth_2.23-20
#> [91] crayon_1.4.1 R.oo_1.24.0
#> [93] htmltools_0.5.2 later_1.3.0
#> [95] segmented_1.3-4 DBI_1.1.1
#> [97] ExperimentHub_2.2.0 dbplyr_2.1.1
#> [99] MASS_7.3-54 fpc_2.2-9
#> [101] rappdirs_0.3.3 boot_1.3-28
#> [103] compositions_2.0-2 ShortRead_1.52.0
#> [105] R.methodsS3_1.8.1 parallel_4.1.1
#> [107] pkgconfig_2.0.3 GenomicAlignments_1.30.0
#> [109] xml2_1.3.2 svglite_2.0.0
#> [111] rARPACK_0.11-0 vipor_0.4.5
#> [113] bslib_0.3.1 webshot_0.5.2
#> [115] XVector_0.34.0 rvest_1.0.2
#> [117] stringr_1.4.0 digest_0.6.28
#> [119] Biostrings_2.62.0 rmarkdown_2.11
#> [121] edgeR_3.36.0 DelayedMatrixStats_1.16.0
#> [123] restfulr_0.0.13 curl_4.3.2
#> [125] kernlab_0.9-29 shiny_1.7.1
#> [127] Rsamtools_2.10.0 gtools_3.9.2
#> [129] modeltools_0.2-23 rjson_0.2.20
#> [131] lifecycle_1.0.1 jsonlite_1.7.2
#> [133] Rhdf5lib_1.16.0 BiocNeighbors_1.12.0
#> [135] viridisLite_0.4.0 limma_3.50.0
#> [137] fansi_0.5.0 pillar_1.6.4
#> [139] lattice_0.20-45 KEGGREST_1.34.0
#> [141] fastmap_1.1.0 httr_1.4.2
#> [143] DEoptimR_1.0-9 survival_3.2-13
#> [145] interactiveDisplayBase_1.32.0 glue_1.4.2
#> [147] png_0.1-7 prabclus_2.3-2
#> [149] BiocVersion_3.14.0 bit_4.0.4
#> [151] class_7.3-19 stringi_1.7.5
#> [153] sass_0.4.0 mixtools_1.2.0
#> [155] blob_1.2.2 AnnotationHub_3.2.0
#> [157] BiocSingular_1.10.0 latticeExtra_0.6-29
#> [159] caTools_1.18.2 memoise_2.0.0
#> [161] dplyr_1.0.7 irlba_2.3.3