SeSAMe provides extensive native support for the Illumina Mouse Methylation BeadChip array (referred to as the MM285 or MMB array) and the Horvath Mammal40 array (refered to as the Mammal40 array). The MM285 array contains ~285,000 probes covering over 20 design categories including gene promoters, enhancers, CpGs in synteny to human EPIC array as well as other biology. In SeSAMe, MM285 is used as the product abbreviation to distinguish future platforms including MM320. This documents describe the procedure to process the MM285 and the Mammal40 array.
We first load required library and perform sesame data caching (only needed at new SeSAMe installation).
SeSAMe supports automatic inference of the profiled organism. This is achieved by the inferSpecies
function. Usually, users need not call this function explicitly and only need to specify the code S
as part of the 2nd argument of the openSesame
function. See the Basics Usage vignette for more detail.
The following example downloads an example Mammal40 array IDAT pair and call openSesame
function with species inference (note the S
in the prep=
argument).
tmp = tempdir()
sesameAnno_download("Test/GSM4411982_Grn.idat.gz", dest_dir=tmp)
sesameAnno_download("Test/GSM4411982_Red.idat.gz", dest_dir=tmp)
betas = openSesame(sprintf("%s/Test/GSM4411982", tmp), prep="SHCDPM")
The above code is equivalent to
## equivalent to the above openSesame call
betas = getBetas(matchDesign(pOOBAH(dyeBiasNL(inferInfiniumIChannel(
prefixMaskButC(inferSpecies(readIDATpair(
sprintf("%s/Test/GSM4411982", tmp)))))))))
As can be seen, inferSpecies
takes a SigDF as input and outputs an updated SigDF which contains a species-specific masking and color channel designation. This information is key to proper preprocessing since knowledge of the color channel designation and probe hybridization success is the foundational assumption of many preprocessing algorithms. One may instruct the function to return the inferred species information by using the return.species = TRUE
argument. The following example shows this usage:
## $scientificName
## [1] "xenopus_tropicalis"
##
## $taxonID
## [1] 8364
##
## $commonName
## [1] "Xenopus tropicalis"
##
## $assembly
## [1] "Xenopus_tropicalis_v9.1"
Internally, we used a nonparametric scoring method to infer the most likely species from a pool of 310 candidate species from Ensembl (v101). The return.auc = TRUE
argument allows one to peek into the AUC (Area Under the Curve) score generated in this inference. The higher the score, the more likely the data is from the corresponding species. Knowing the scores can help diagnose misclassifications such as when several candidate species are closely related and hard to discriminate from data.
## showing the candidate species with top scores
head(sort(inferSpecies(sdf, return.auc = TRUE), decreasing=TRUE))
## xenopus_tropicalis leptobrachium_leishanense latimeria_chalumnae
## 0.9648250 0.9160480 0.8323410
## pseudonaja_textilis notechis_scutatus salvator_merianae
## 0.8262425 0.8232040 0.8226535
If the user believes that automatic inference gives wrong (most often still close-related) species, one can force species inference to a target species by using the updateSigDF
function. For example, the following code forces the SigDF
to be treated as a mus_musculus
sample. Note this doesn’t alter signal intensity but only change the probe masking and color channel spec (the view of the data, not the data reading itself).
CRITICAL: Since updateSigDF
function resets the whole mask and col column of SigDF. One should use this function (and inferSpecies
) before other preprocessing functions that sets mask and col.
Like species inference, strain-specific masking and preprocessing is important for mouse array samples. This is achieved by the inferStrain
function. The function is represented by the T
code in openSesame
/prepSesame
. The following example shows how to use inferStrain
in openSesame
. Note the use of T
in the prep code.
tmp = tempdir()
sesameAnno_download("Test/204637490002_R05C01_Grn.idat", dest_dir=tmp)
sesameAnno_download("Test/204637490002_R05C01_Red.idat", dest_dir=tmp)
betas = openSesame(sprintf("%s/Test/204637490002_R05C01", tmp), prep="TQCDPB")
Like inferSpecies
, one need to call the inferStrain
function before calling the standard noob
, dyeBiasNL
, etc (by having T
before QCDPB
when calling openSesame
). Also like inferSpecies()
, inferStrain()
will return a new SigDF
with col and mask updated to reflect the change of strain. Optionally, one can also specify return.strain=TRUE
, return.pval=TRUE
or return.probability=TRUE
to return the inferred strain, the p-value or the probabilities of all 37 strain candidates. Internally, the function converts the beta values to variant allele frequencies. It should be noted that since variant allele frequency is not always measured by the M-allele of the probe. SeSAMe flips the β values for some probes to calculate variant allele frequency. The following example shows what inferStrain
does to a SigDF
:
sdf = sesameDataGet("MM285.1.SigDF") # an example dataset
inferStrain(sdf, return.strain = TRUE) # return strain as a string
## [1] "NOD_ShiLtJ"
sdf_after = inferStrain(sdf) # update mask and col by strain inference
sum(sdf$mask) # before strain inference
## [1] 0
## [1] 14599
Let’s visualize the probabilities of all candidate strains using the return.probabilities
option:
library(ggplot2)
p = inferStrain(sdf, return.probability = TRUE)
df = data.frame(strain=names(p), probs=p)
ggplot(data = df, aes(x = strain, y = probs)) +
geom_bar(stat = "identity", color="gray") +
ggtitle("Strain Probabilities") +
ylab("Probability") + xlab("") +
scale_x_discrete(position = "top") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0),
legend.position = "none")
See also The Supplemental Vignette for heatmap validation of strain inference.
In the MM285 array, we designed multimapping probes to allow querying transposable elements and other biology. We also exposed probes with potentially design flaws. These suboptimally designed probes take a new probe ID prefix (“uk”) in addition to the “cg”/“ch”/“rs” typically seen. By default the repeat and suboptimally designed probes are masked by NA
. These masking is done by the qualityMask
function (or Q
in prep codes). To override masking these probes, one can use the resetMask
function (the 0
code in openSesame
) to remove the default masking. We recommend using it after the preprocessing function that depends on reliable/uniquely-mapped probes, but before detection p-value based masking (e.g. pOOBAH). This way, probes that fail detection can still be flagged (they should be).
## [1] 20554
## [1] 7741
UNDER CONSTRUCTION
There are other inferences one can do on the nonhuman arrays, e.g., sex, age (epigenetic clocks), tissue, copy number alteration etc. These will be elaborated in The Inference Vignette.
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] wheatmap_0.2.0 ggplot2_3.3.6
## [3] tidyr_1.2.0 dplyr_1.0.9
## [5] SummarizedExperiment_1.26.1 Biobase_2.56.0
## [7] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
## [9] IRanges_2.30.0 S4Vectors_0.34.0
## [11] MatrixGenerics_1.8.0 matrixStats_0.62.0
## [13] knitr_1.39 sesame_1.14.2
## [15] sesameData_1.14.0 ExperimentHub_2.4.0
## [17] AnnotationHub_3.4.0 BiocFileCache_2.4.0
## [19] dbplyr_2.1.1 BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-157 bitops_1.0-7
## [3] bit64_4.0.5 filelock_1.0.2
## [5] RColorBrewer_1.1-3 httr_1.4.3
## [7] tools_4.2.0 bslib_0.3.1
## [9] utf8_1.2.2 R6_2.5.1
## [11] mgcv_1.8-40 DBI_1.1.2
## [13] colorspace_2.0-3 withr_2.5.0
## [15] tidyselect_1.1.2 preprocessCore_1.58.0
## [17] bit_4.0.4 curl_4.3.2
## [19] compiler_4.2.0 cli_3.3.0
## [21] DelayedArray_0.22.0 labeling_0.4.2
## [23] sass_0.4.1 scales_1.2.0
## [25] randomForest_4.7-1 readr_2.1.2
## [27] proxy_0.4-26 rappdirs_0.3.3
## [29] stringr_1.4.0 digest_0.6.29
## [31] rmarkdown_2.14 XVector_0.36.0
## [33] pkgconfig_2.0.3 htmltools_0.5.2
## [35] highr_0.9 fastmap_1.1.0
## [37] rlang_1.0.2 RSQLite_2.2.14
## [39] shiny_1.7.1 farver_2.1.0
## [41] jquerylib_0.1.4 generics_0.1.2
## [43] jsonlite_1.8.0 BiocParallel_1.30.2
## [45] RCurl_1.98-1.6 magrittr_2.0.3
## [47] GenomeInfoDbData_1.2.8 Matrix_1.4-1
## [49] Rcpp_1.0.8.3 munsell_0.5.0
## [51] fansi_1.0.3 lifecycle_1.0.1
## [53] stringi_1.7.6 yaml_2.3.5
## [55] MASS_7.3-57 zlibbioc_1.42.0
## [57] plyr_1.8.7 grid_4.2.0
## [59] blob_1.2.3 ggrepel_0.9.1
## [61] parallel_4.2.0 promises_1.2.0.1
## [63] crayon_1.5.1 lattice_0.20-45
## [65] splines_4.2.0 Biostrings_2.64.0
## [67] hms_1.1.1 KEGGREST_1.36.0
## [69] pillar_1.7.0 reshape2_1.4.4
## [71] glue_1.6.2 BiocVersion_3.15.2
## [73] evaluate_0.15 BiocManager_1.30.18
## [75] png_0.1-7 vctrs_0.4.1
## [77] tzdb_0.3.0 httpuv_1.6.5
## [79] gtable_0.3.0 purrr_0.3.4
## [81] assertthat_0.2.1 cachem_1.0.6
## [83] xfun_0.31 mime_0.12
## [85] xtable_1.8-4 e1071_1.7-9
## [87] later_1.3.0 class_7.3-20
## [89] tibble_3.1.7 AnnotationDbi_1.58.0
## [91] memoise_2.0.1 ellipsis_0.3.2
## [93] interactiveDisplayBase_1.34.0 BiocStyle_2.24.0