TENxPBMCData 1.6.0
The TENxPBMCData package provides a R / Bioconductor resource for representing and manipulating nine different single-cell RNA-seq (scRNA-seq) data sets on peripheral blood mononuclear cells (PBMC) generated by 10X Genomics:
The number in the dataset
title is roughly the number of cells in the experiment.
This package makes extensive use of the HDF5Array package to avoid loading the entire data set in memory, instead storing the counts on disk as a HDF5 file and loading subsets of the data into memory upon request.
Note: The purpose of this package is to provide testing and example data for Bioconductor packages. We have done no processing of the “filtered” 10X scRNA-RNA data; it is delivered as is.
We use the TENxPBMCData
function to download the relevant files
from Bioconductor’s ExperimentHub web resource. This includes the
HDF5 file containing the counts, as well as the metadata on the rows
(genes) and columns (cells). The output is a single
SingleCellExperiment
object from the SingleCellExperiment
package. This is equivalent to a SummarizedExperiment
class but
with a number of features specific to single-cell data.
library(TENxPBMCData)
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")
tenx_pbmc4k
## class: SingleCellExperiment
## dim: 33694 4340
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## altExpNames(0):
Note: of particular interest to some users might be the pbmc68k
dataset for its size.
The first call to TENxPBMCData()
may take some time due to the
need to download some moderately large files. The files are then
stored locally such that ensuing calls in the same or new sessions are
fast. Use the dataset
argument to select which dataset to download; values are visible through the function definition:
args(TENxPBMCData)
## function (dataset = c("pbmc4k", "pbmc4k", "pbmc68k", "frozen_pbmc_donor_a",
## "frozen_pbmc_donor_b", "frozen_pbmc_donor_c", "pbmc33k",
## "pbmc3k", "pbmc6k", "pbmc4k", "pbmc8k"))
## NULL
The count matrix itself is represented as a DelayedMatrix
from the
DelayedArray package. This wraps the underlying HDF5
file in a container that can be manipulated in R. Each count
represents the number of unique molecular identifiers (UMIs) assigned
to a particular gene in a particular cell.
counts(tenx_pbmc4k)
## <33694 x 4340> matrix of class DelayedMatrix and type "integer":
## [,1] [,2] [,3] [,4] ... [,4337] [,4338] [,4339]
## ENSG00000243485 0 0 0 0 . 0 0 0
## ENSG00000237613 0 0 0 0 . 0 0 0
## ENSG00000186092 0 0 0 0 . 0 0 0
## ENSG00000238009 0 0 0 0 . 0 0 0
## ENSG00000239945 0 0 0 0 . 0 0 0
## ... . . . . . . . .
## ENSG00000277856 0 0 0 0 . 0 0 0
## ENSG00000275063 0 0 0 0 . 0 0 0
## ENSG00000271254 0 0 0 0 . 0 0 0
## ENSG00000277475 0 0 0 0 . 0 0 0
## ENSG00000268674 0 0 0 0 . 0 0 0
## [,4340]
## ENSG00000243485 0
## ENSG00000237613 0
## ENSG00000186092 0
## ENSG00000238009 0
## ENSG00000239945 0
## ... .
## ENSG00000277856 0
## ENSG00000275063 0
## ENSG00000271254 0
## ENSG00000277475 0
## ENSG00000268674 0
To quickly explore the data set, we compute some summary statistics on the count matrix. We tell the DelayedArray block size to indicate that we can use up to 1 GB of memory for loading the data into memory from disk.
options(DelayedArray.block.size=1e9)
We are interested in library sizes colSums(counts(tenx_pbmc4k))
, number of
genes expressed per cell colSums(counts(tenx_pbmc4k) != 0)
, and average
expression across cells rowMeans(counts(tenx_pbmc4k))
. A naive implement
might be
lib.sizes <- colSums(counts(tenx_pbmc4k))
n.exprs <- colSums(counts(tenx_pbmc4k) != 0L)
ave.exprs <- rowMeans(counts(tenx_pbmc4k))
More advanced analysis procedures are implemented in various
Bioconductor packages - see the SingleCell
biocViews for more
details.
Saving the tenx_pbmc4k
object in a standard manner, e.g.,
destination <- tempfile()
saveRDS(tenx_pbmc4k, file = destination)
saves the row-, column-, and meta-data as an R object, and remembers
the location and subset of the HDF5 file from which the object is
derived. The object can be read into a new R session with
readRDS(destination)
, provided the HDF5 file remains in it’s
original location.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TENxPBMCData_1.6.0 HDF5Array_1.16.0
## [3] rhdf5_2.32.0 SingleCellExperiment_1.10.1
## [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
## [7] matrixStats_0.56.0 Biobase_2.48.0
## [9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
## [11] IRanges_2.22.1 S4Vectors_0.26.0
## [13] BiocGenerics_0.34.0 knitr_1.28
## [15] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 lattice_0.20-41
## [3] assertthat_0.2.1 digest_0.6.25
## [5] mime_0.9 BiocFileCache_1.12.0
## [7] R6_2.4.1 RSQLite_2.2.0
## [9] evaluate_0.14 httr_1.4.1
## [11] pillar_1.4.4 zlibbioc_1.34.0
## [13] rlang_0.4.6 curl_4.3
## [15] blob_1.2.1 Matrix_1.2-18
## [17] rmarkdown_2.1 AnnotationHub_2.20.0
## [19] stringr_1.4.0 RCurl_1.98-1.2
## [21] bit_1.1-15.2 shiny_1.4.0.2
## [23] httpuv_1.5.2 compiler_4.0.0
## [25] xfun_0.13 pkgconfig_2.0.3
## [27] htmltools_0.4.0 tidyselect_1.0.0
## [29] interactiveDisplayBase_1.26.0 tibble_3.0.1
## [31] GenomeInfoDbData_1.2.3 bookdown_0.18
## [33] later_1.0.0 crayon_1.3.4
## [35] dplyr_0.8.5 dbplyr_1.4.3
## [37] bitops_1.0-6 rappdirs_0.3.1
## [39] grid_4.0.0 xtable_1.8-4
## [41] lifecycle_0.2.0 DBI_1.1.0
## [43] magrittr_1.5 stringi_1.4.6
## [45] XVector_0.28.0 promises_1.1.0
## [47] ellipsis_0.3.0 vctrs_0.2.4
## [49] Rhdf5lib_1.10.0 tools_4.0.0
## [51] bit64_0.9-7 glue_1.4.0
## [53] BiocVersion_3.11.1 purrr_0.3.4
## [55] fastmap_1.0.1 yaml_2.2.1
## [57] AnnotationDbi_1.50.0 BiocManager_1.30.10
## [59] ExperimentHub_1.14.0 memoise_1.1.0