CuratedAtlasQuery
is a query interface that allow the
programmatic exploration and retrieval of the harmonised, curated and
reannotated CELLxGENE single-cell human cell atlas. Data can be
retrieved at cell, sample, or dataset levels based on filtering
criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most
CuratedAtlasQuery
functions interact with Nectar via web
requests, so a network connection is required for most
functionality.
# Note: in real applications you should use the default value of remote_url
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
#> # Source: SQL [?? x 56]
#> # Database: DuckDB v1.1.2 [biocbuild@Linux 6.8.0-48-generic:R 4.5.0/:memory:]
#> cell_ sample_ cell_type cell_type_harmonised confidence_class
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 TTATGCTAGGGTGTTG_12 039c558c… mature N… immune_unclassified 5
#> 2 GCTTGAACATGGTCTA_12 039c558c… mature N… cd8 tem 3
#> 3 GTCATTTGTGCACCAC_12 039c558c… mature N… immune_unclassified 5
#> 4 AAGGAGCGTATTCTCT_12 039c558c… mature N… immune_unclassified 5
#> 5 ATTGGACTCCGTACAA_12 039c558c… mature N… immune_unclassified 5
#> 6 CTCGTCAAGTACGTTC_12 039c558c… mature N… immune_unclassified 5
#> 7 CGTCTACTCTCTAAGG_11 07e64957… mature N… immune_unclassified 5
#> 8 ACGATCAAGGTGGTTG_15 17640030… mature N… immune_unclassified 5
#> 9 TCATACTTCCGCACGA_15 17640030… mature N… immune_unclassified 5
#> 10 GAGACCCTCCCTTCCC_15 17640030… mature N… immune_unclassified 5
#> # ℹ more rows
#> # ℹ 51 more variables: cell_annotation_azimuth_l2 <chr>,
#> # cell_annotation_blueprint_singler <chr>,
#> # cell_annotation_monaco_singler <chr>, sample_id_db <chr>,
#> # `_sample_name` <chr>, assay <chr>, assay_ontology_term_id <chr>,
#> # file_id_db <chr>, cell_type_ontology_term_id <chr>,
#> # development_stage <chr>, development_stage_ontology_term_id <chr>, …
The metadata
variable can then be re-used for all
subsequent queries.
metadata |>
dplyr::distinct(tissue, file_id)
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.1.2 [biocbuild@Linux 6.8.0-48-generic:R 4.5.0/:memory:]
#> tissue file_id
#> <chr> <chr>
#> 1 heart left ventricle 5775c8d8-e37e-40cd-94f4-8e78b05ca331
#> 2 kidney 5ef8b993-4a02-42ee-9202-a595f6e9a758
#> 3 thymus 5c1cc788-2645-45fb-b1d9-2f43d368bba8
#> 4 respiratory airway fe1bbb3e-8c3b-4dfd-ae20-9d288b8a7699
#> 5 blood 79d07078-90fd-43c3-b705-46c9b4d9d8d3
#> 6 mesenteric lymph node 59dfc135-19c1-4380-a9e8-958908273756
#> 7 kidney blood vessel f7e94dbb-8638-4616-aaf9-16e2212c369f
#> 8 kidney blood vessel 8fee7b82-178b-4c04-bf23-04689415690d
#> 9 respiratory airway 6661ab3a-792a-4682-b58c-4afb98b2c016
#> 10 retina 94039710-0387-40e1-9667-dbbac4c469c1
#> # ℹ more rows
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 36229 1571
#> metadata(0):
#> assays(1): counts
#> rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
This is helpful if just few genes are of interest, as they can be compared across samples.
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 36229 1571
#> metadata(0):
#> assays(1): cpm
#> rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 1 1571
#> metadata(0):
#> assays(1): cpm
#> rownames(1): PUM1
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.
single_cell_counts_seurat =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts_seurat
#> An object of class Seurat
#> 36229 features across 1571 samples within 1 assay
#> Active assay: originalexp (36229 features, 0 variable features)
#> 2 layers present: counts, data
SingleCellExperiment
The returned SingleCellExperiment
can be saved with two
modalities, as .rds
or as HDF5
.
Saving as .rds
has the advantage of being fast, andd the
.rds
file occupies very little disk space as it only stores
the links to the files in your cache.
However it has the disadvantage that for big
SingleCellExperiment
objects, which merge a lot of HDF5
from your get_single_cell_experiment
, the display and
manipulation is going to be slow. In addition, an .rds
saved in this way is not portable: you will not be able to share it with
other users.
Saving as .hdf5
executes any computation on the
SingleCellExperiment
and writes it to disk as a monolithic
HDF5
. Once this is done, operations on the
SingleCellExperiment
will be comparatively very fast. The
resulting .hdf5
file will also be totally portable and
sharable.
However this .hdf5
has the disadvantage of being larger
than the corresponding .rds
as it includes a copy of the
count information, and the saving process is going to be slow for large
objects.
We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues
suppressPackageStartupMessages({
library(ggplot2)
})
# Plots with styling
counts <- metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised == "cd14 mono") |>
dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Add feature to table
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
# Rank x axis
tibble::as_tibble()
# Plot by disease
counts |>
dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Disease") +
ggtitle("HLA-A in CD14 monocytes by disease")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
# Plot by tissue
counts |>
dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Tissue") +
ggtitle("HLA-A in CD14 monocytes by tissue") +
theme(legend.position = "none")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised=="nk") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Plot
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
legend.position = "none"
) +
geom_jitter(shape=".") +
xlab("Tissue") +
ggtitle("HLA-A in nk cells by tissue")
Various metadata fields are not common between datasets, so
it does not make sense for these to live in the main metadata table.
However, we can obtain it using the
get_unharmonised_metadata()
function. This function returns
a data frame with one row per dataset, including the
unharmonised
column which contains unharmnised metadata as
a nested data frame.
harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel")
unharmonised <- get_unharmonised_metadata(harmonised)
unharmonised
#> # A tibble: 4 × 2
#> file_id unharmonised
#> <chr> <list>
#> 1 63523aa3-0d04-4fc6-ac59-5cadd3e73a14 <tbl_dck_[,17]>
#> 2 8fee7b82-178b-4c04-bf23-04689415690d <tbl_dck_[,12]>
#> 3 dc9d8cdd-29ee-4c44-830c-6559cb3d0af6 <tbl_dck_[,14]>
#> 4 f7e94dbb-8638-4616-aaf9-16e2212c369f <tbl_dck_[,14]>
Notice that the columns differ between each dataset’s data frame:
dplyr::pull(unharmonised) |> head(2)
#> [[1]]
#> # Source: SQL [?? x 17]
#> # Database: DuckDB v1.1.2 [biocbuild@Linux 6.8.0-48-generic:R 4.5.0/:memory:]
#> cell_ file_id donor_age donor_uuid library_uuid mapped_reference_ann…¹
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 2 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 3 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 4 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 5 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 6 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 7 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 8 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 9 4602STDY709… 63523a… 19 months 46318131-… 67178571-39… GENCODE 24
#> 10 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> # ℹ more rows
#> # ℹ abbreviated name: ¹mapped_reference_annotation
#> # ℹ 11 more variables: sample_uuid <chr>, suspension_type <chr>,
#> # suspension_uuid <chr>, author_cell_type <chr>, cell_state <chr>,
#> # reported_diseases <chr>, Short_Sample <chr>, Project <chr>,
#> # Experiment <chr>, compartment <chr>, broad_celltype <chr>
#>
#> [[2]]
#> # Source: SQL [?? x 12]
#> # Database: DuckDB v1.1.2 [biocbuild@Linux 6.8.0-48-generic:R 4.5.0/:memory:]
#> cell_ file_id orig.ident nCount_RNA nFeature_RNA seurat_clusters Project
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 1069 8fee7b82-17… 4602STDY7… 16082 3997 25 Experi…
#> 2 1214 8fee7b82-17… 4602STDY7… 1037 606 25 Experi…
#> 3 2583 8fee7b82-17… 4602STDY7… 3028 1361 25 Experi…
#> 4 2655 8fee7b82-17… 4602STDY7… 1605 859 25 Experi…
#> 5 3609 8fee7b82-17… 4602STDY7… 1144 682 25 Experi…
#> 6 3624 8fee7b82-17… 4602STDY7… 1874 963 25 Experi…
#> 7 3946 8fee7b82-17… 4602STDY7… 1296 755 25 Experi…
#> 8 5163 8fee7b82-17… 4602STDY7… 11417 3255 25 Experi…
#> 9 5446 8fee7b82-17… 4602STDY7… 1769 946 19 Experi…
#> 10 6275 8fee7b82-17… 4602STDY7… 3750 1559 25 Experi…
#> # ℹ more rows
#> # ℹ 5 more variables: donor_id <chr>, compartment <chr>, broad_celltype <chr>,
#> # author_cell_type <chr>, Sample <chr>
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
cell_count
, collection_id
,
created_at.x
, created_at.y
,
dataset_deployments
, dataset_id
,
file_id
, filename
, filetype
,
is_primary_data.y
, is_valid
,
linked_genesets
, mean_genes_per_cell
,
name
, published
, published_at
,
revised_at
, revision
, s3_uri
,
schema_version
, tombstone
,
updated_at.x
, updated_at.y
,
user_submitted
, x_normalization
Sample-specific columns (definitions available at cellxgene.cziscience.com)
sample_
, sample_name
,
age_days
, assay
,
assay_ontology_term_id
, development_stage
,
development_stage_ontology_term_id
, ethnicity
,
ethnicity_ontology_term_id
, experiment___
,
organism
, organism_ontology_term_id
,
sample_placeholder
, sex
,
sex_ontology_term_id
, tissue
,
tissue_harmonised
, tissue_ontology_term_id
,
disease
, disease_ontology_term_id
,
is_primary_data.x
Cell-specific columns (definitions available at cellxgene.cziscience.com)
cell_
, cell_type
,
cell_type_ontology_term_idm
,
cell_type_harmonised
, confidence_class
,
cell_annotation_azimuth_l2
,
cell_annotation_blueprint_singler
Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
tissue_harmonised
: a coarser tissue name for better
filteringage_days
: the number of days corresponding to the
agecell_type_harmonised
: the consensus call identity (for
immune cells) using the original and three novel annotations using
Seurat Azimuth and SingleRconfidence_class
: an ordinal class of how confident
cell_type_harmonised
is. 1 is complete consensus, 2 is 3
out of four and so on.cell_annotation_azimuth_l2
: Azimuth cell
annotationcell_annotation_blueprint_singler
: SingleR cell
annotation using Blueprint referencecell_annotation_blueprint_monaco
: SingleR cell
annotation using Monaco referencesample_id_db
: Sample subdivision for internal usefile_id_db
: File subdivision for internal usesample_
: Sample ID.sample_name
: How samples were definedThe raw
assay includes RNA abundance in the positive
real scale (not transformed with non-linear functions, e.g. log sqrt).
Originally CELLxGENE include a mix of scales and transformations
specified in the x_normalization
column.
The cpm
assay includes counts per million.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 CuratedAtlasQueryR_1.5.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] magrittr_2.0.3 spatstat.utils_3.1-1
#> [5] farver_2.1.2 rmarkdown_2.29
#> [7] zlibbioc_1.53.0 vctrs_0.6.5
#> [9] ROCR_1.0-11 spatstat.explore_3.3-3
#> [11] forcats_1.0.0 htmltools_0.5.8.1
#> [13] S4Arrays_1.7.1 ttservice_0.4.1
#> [15] Rhdf5lib_1.29.0 SparseArray_1.7.1
#> [17] rhdf5_2.51.0 sass_0.4.9
#> [19] sctransform_0.4.1 parallelly_1.38.0
#> [21] KernSmooth_2.23-24 bslib_0.8.0
#> [23] htmlwidgets_1.6.4 ica_1.0-3
#> [25] plyr_1.8.9 plotly_4.10.4
#> [27] zoo_1.8-12 cachem_1.1.0
#> [29] igraph_2.1.1 mime_0.12
#> [31] lifecycle_1.0.4 pkgconfig_2.0.3
#> [33] Matrix_1.7-1 R6_2.5.1
#> [35] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [37] MatrixGenerics_1.19.0 fitdistrplus_1.2-1
#> [39] future_1.34.0 shiny_1.9.1
#> [41] digest_0.6.37 colorspace_2.1-1
#> [43] patchwork_1.3.0 S4Vectors_0.45.0
#> [45] rprojroot_2.0.4 Seurat_5.1.0
#> [47] tensor_1.5 RSpectra_0.16-2
#> [49] irlba_2.3.5.1 GenomicRanges_1.59.0
#> [51] labeling_0.4.3 progressr_0.15.0
#> [53] fansi_1.0.6 spatstat.sparse_3.1-0
#> [55] httr_1.4.7 polyclip_1.10-7
#> [57] abind_1.4-8 compiler_4.5.0
#> [59] withr_3.0.2 DBI_1.2.3
#> [61] fastDummies_1.7.4 highr_0.11
#> [63] HDF5Array_1.35.1 duckdb_1.1.2
#> [65] MASS_7.3-61 DelayedArray_0.33.1
#> [67] tools_4.5.0 lmtest_0.9-40
#> [69] httpuv_1.6.15 future.apply_1.11.3
#> [71] goftest_1.2-3 glue_1.8.0
#> [73] nlme_3.1-166 rhdf5filters_1.19.0
#> [75] promises_1.3.0 grid_4.5.0
#> [77] Rtsne_0.17 cluster_2.1.6
#> [79] reshape2_1.4.4 generics_0.1.3
#> [81] gtable_0.3.6 spatstat.data_3.1-2
#> [83] tidyr_1.3.1 data.table_1.16.2
#> [85] sp_2.1-4 utf8_1.2.4
#> [87] XVector_0.47.0 BiocGenerics_0.53.1
#> [89] spatstat.geom_3.3-3 RcppAnnoy_0.0.22
#> [91] ggrepel_0.9.6 RANN_2.6.2
#> [93] pillar_1.9.0 stringr_1.5.1
#> [95] spam_2.11-0 RcppHNSW_0.6.0
#> [97] later_1.3.2 splines_4.5.0
#> [99] dplyr_1.1.4 lattice_0.22-6
#> [101] deldir_2.0-4 survival_3.7-0
#> [103] tidyselect_1.2.1 SingleCellExperiment_1.29.0
#> [105] miniUI_0.1.1.1 pbapply_1.7-2
#> [107] knitr_1.48 gridExtra_2.3
#> [109] IRanges_2.41.0 SummarizedExperiment_1.37.0
#> [111] scattermore_1.2 stats4_4.5.0
#> [113] xfun_0.49 Biobase_2.67.0
#> [115] matrixStats_1.4.1 UCSC.utils_1.3.0
#> [117] stringi_1.8.4 lazyeval_0.2.2
#> [119] yaml_2.3.10 evaluate_1.0.1
#> [121] codetools_0.2-20 tibble_3.2.1
#> [123] cli_3.6.3 uwot_0.2.2
#> [125] xtable_1.8-4 reticulate_1.39.0
#> [127] munsell_0.5.1 jquerylib_0.1.4
#> [129] GenomeInfoDb_1.43.0 Rcpp_1.0.13-1
#> [131] spatstat.random_3.3-2 globals_0.16.3
#> [133] dbplyr_2.5.0 png_0.1-8
#> [135] spatstat.univar_3.1-1 parallel_4.5.0
#> [137] ellipsis_0.3.2 blob_1.2.4
#> [139] assertthat_0.2.1 dotCall64_1.2
#> [141] tidySingleCellExperiment_1.17.0 listenv_0.9.1
#> [143] viridisLite_0.4.2 scales_1.3.0
#> [145] ggridges_0.5.6 SeuratObject_5.0.2
#> [147] leiden_0.4.3.1 purrr_1.0.2
#> [149] crayon_1.5.3 rlang_1.1.4
#> [151] cowplot_1.1.3