SingleCellExperiment 1.18.1
By design, the scope of this package is limited to defining the SingleCellExperiment
class and some minimal getter and setter methods.
For this reason, we leave it to developers of specialized packages to provide more advanced methods for the SingleCellExperiment
class.
If packages define their own data structure, it is their responsibility to provide coercion methods to/from their classes to SingleCellExperiment
.
For developers, the use of SingleCellExperiment
objects within package functions is mostly the same as the use of instances of the base SummarizedExperiment
class.
The only exceptions involve direct access to the internal fields of the SingleCellExperiment
definition.
Manipulation of these internal fields in other packages is possible but requires some caution, as we shall discuss below.
We use an internal storage mechanism to protect certain fields from direct manipulation by the user.
This ensures that only a call to the provided setter methods can change the size factors.
The same effect could be achieved by reserving a subset of columns (or column names) as “private” in colData()
and rowData()
, though this is not easily implemented.
The internal storage avoids situations where users or functions can silently overwrite these important metadata fields during manipulations of rowData
or colData
.
This can result in bugs that are difficult to track down, particularly in long workflows involving many functions.
It also allows us to add new methods and metadata types to SingleCellExperiment
without worrying about overwriting user-supplied metadata in existing objects.
Methods to get or set the internal fields are exported for use by developers of packages that depend on SingleCellExperiment. This allows dependent packages to store their own custom fields that are not meant to be directly accessible by the user. However, this requires some care to avoid conflicts between packages.
The concern is that package A and B both define methods that get/set an internal field X
in a SingleCellExperiment
instance.
Consider the following example object:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Assume that we have functions that set an internal field X
in packages A and B.
# Function in package A:
AsetX <- function(sce) {
int_colData(sce)$X <- runif(ncol(sce))
sce
}
# Function in package B:
BsetX <- function(sce) {
int_colData(sce)$X <- sample(LETTERS, ncol(sce), replace=TRUE)
sce
}
If both of these functions are called, one will clobber the output of the other. This may lead to nonsensical results in downstream procedures.
sce2 <- AsetX(sce)
int_colData(sce2)$X
## [1] 0.07887883 0.32435470 0.04705715 0.01332028 0.67232326 0.11619728
## [7] 0.34778133 0.15453983 0.83287347 0.44342647
sce2 <- BsetX(sce2)
int_colData(sce2)$X
## [1] "S" "Q" "A" "P" "X" "O" "J" "J" "P" "M"
We recommend using nested DataFrame
s to store internal fields in the column-level metadata.
The name of the nested element should be set to the package name, thus avoiding clashes between fields with the same name from different packages.
AsetX_better <- function(sce) {
int_colData(sce)$A <- DataFrame(X=runif(ncol(sce)))
sce
}
BsetX_better <- function(sce) {
choice <- sample(LETTERS, ncol(sce), replace=TRUE)
int_colData(sce)$B <- DataFrame(X=choice)
sce
}
sce2 <- AsetX_better(sce)
sce2 <- BsetX_better(sce2)
int_colData(sce2)$A$X
## [1] 0.0004691291 0.4537550765 0.6198088124 0.3739408611 0.5113576634
## [6] 0.2792894603 0.4517558753 0.9402053002 0.5728990654 0.3548065552
int_colData(sce2)$B$X
## [1] "P" "P" "P" "N" "V" "L" "S" "F" "D" "Z"
The same approach can be applied to the row-level metadata, e.g., for some per-row field Y
.
AsetY_better <- function(sce) {
int_elementMetadata(sce)$A <- DataFrame(Y=runif(nrow(sce)))
sce
}
BsetY_better <- function(sce) {
choice <- sample(LETTERS, nrow(sce), replace=TRUE)
int_elementMetadata(sce)$B <- DataFrame(Y=choice)
sce
}
sce2 <- AsetY_better(sce)
sce2 <- BsetY_better(sce2)
int_elementMetadata(sce2)$A$Y
## [1] 0.7103048 0.9683102 0.9759327 0.6107102 0.2108676 0.3107665 0.1025274
## [8] 0.2885269 0.6958342 0.9450609
int_elementMetadata(sce2)$B$Y
## [1] "X" "M" "O" "T" "Z" "V" "O" "M" "Y" "M"
For the object-wide metadata, a nested list is usually sufficient.
AsetZ_better <- function(sce) {
int_metadata(sce)$A <- list(Z = "Aaron")
sce
}
BsetZ_better <- function(sce) {
int_metadata(sce)$B <- list(Z = "Davide")
sce
}
sce2 <- AsetZ_better(sce)
sce2 <- BsetZ_better(sce2)
int_metadata(sce2)$A$Z
## [1] "Aaron"
int_metadata(sce2)$B$Z
## [1] "Davide"
In this manner, both A and B can set their internal X
, Y
and Z
without interfering with each other.
Of course, this strategy assumes that packages do not have the same names as some of the in-built internal fields (which would be very unfortunate).
If your package accesses the internal fields of the SingleCellExperiment
class, we suggest you get into contact with us on GitHub.
This will help us in planning changes to the internal organization of the class.
It will also allow us to contact you with respect to changes or to get feedback.
We are particularly interested in scenarios where multiple packages are defining internal fields with the same scientific meaning. In such cases, it may be valuable to provide getters and setters for this field in SingleCellExperiment directly. This reduces redundancy in the definitions across packages and promotes interoperability. For example, methods from one package can set the field, which can then be used by methods of another package.
reducedDims
?We use a SimpleList
as the reducedDims
slot to allow for multiple dimensionality reduction results.
One can imagine that different dimensionality reduction techniques will be useful for different aspects of the analysis, e.g., t-SNE for visualization, PCA for pseudo-time inference.
We see reducedDims
as a similar slot to assays()
in that multiple matrices can be stored, though the dimensionality reduction results need not have the same number of dimensions.
RangedSummarizedExperiment
?We decided to extend RangedSummarizedExperiment
rather than SummarizedExperiment
because for certain assays it will be essential to have rowRanges()
.
Even for RNA-seq, it is sometimes useful to have rowRanges()
and other classes to define the genomic coordinates, e.g., DESeqDataSet
in the DESeq2 package.
An alternative would have been to have two classes, SingleCellExperiment
and RangedSingleCellExperiment
.
However, this seems like an unnecessary duplication as having a class with default empty rowRanges
seems good enough when one does not need rowRanges
.
MultiAssayExperiment
?Another approach to storing alternative Experiments would be to use a MultiAssayExperiment
.
We do not do so as the vast majority of scRNA-seq data analyses operate on the endogenous genes.
Switching to a MultiAssayExperiment
introduces an additional layer of indirection with no benefit in most cases.
Indeed, the methods of this class are largely unnecessary when the alternative Experiments contain data for the same samples.
By storing nested Experiments, we maintain the familiar SummarizedExperiment
interface for better compatibility and ease of use.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1
## [3] Biobase_2.56.0 GenomicRanges_1.48.0
## [5] GenomeInfoDb_1.32.4 IRanges_2.30.1
## [7] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [9] MatrixGenerics_1.8.1 matrixStats_0.62.0
## [11] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] bslib_0.4.0 compiler_4.2.1 BiocManager_1.30.18
## [4] jquerylib_0.1.4 XVector_0.36.0 bitops_1.0-7
## [7] tools_4.2.1 zlibbioc_1.42.0 digest_0.6.29
## [10] lattice_0.20-45 jsonlite_1.8.2 evaluate_0.16
## [13] rlang_1.0.6 Matrix_1.5-1 DelayedArray_0.22.0
## [16] cli_3.4.1 yaml_2.3.5 xfun_0.33
## [19] fastmap_1.1.0 GenomeInfoDbData_1.2.8 stringr_1.4.1
## [22] knitr_1.40 sass_0.4.2 grid_4.2.1
## [25] R6_2.5.1 rmarkdown_2.16 bookdown_0.29
## [28] magrittr_2.0.3 htmltools_0.5.3 stringi_1.7.8
## [31] RCurl_1.98-1.8 cachem_1.0.6