DNA methylation studies have increased in number over the past decade
thanks to the recent advances in next-generation sequencing (NGS) and
microarray technology (MA), providing many data sets at high resolution,
enabling researchers to understand methylation patterns and their
regulatory roles in biological processes and diseases.
Notwithstanding that diverse methods and software have created ample
opportunities for researchers to do quantitative analysis, they make it
difficult for practitioners to choose the one that is suitable and
efficient in analyzing DNA methylation data. Having examined most of
differentially methylation identification tools for bisulfite sequencing
(BS-Seq) data, we observed several drawbacks in the existing analytic
tools. To address these issues we have developed a novel differentially
methylated CpG site identification tool which is based on Hidden Markov
models (HMM) called DMCHMM
. This vignette provides some
guidelines on how to use the package and analyze BS-Seq data.
Following topics will be discussed in this vignette:
Two different classes are defined by extending the
SummarizedExperiment-class
. The BSData-class
is designed to hold BS-Seq data. Similarly, cBSData-method
is defined to create a BSData
object. This class includes
two slots: the methReads
, a matrix with columns
representing samples and rows representing genomic positions (CpG sites)
and elements of matrix representing methylation counts at each position
in each sample; the totalReads
, a matrix with similar
columns and rows except the elements representing total number of
reads.
For reading raw BS-Seq data we adopted The readBismark
function from BiSeq
package. The
readBismark-method
reads samples stored in different files
with six columns of chromosome, start position,
end position, methylation percentage, number of
Cs and number of Ts.
Three data files are included in the DMCHMM
package for
illustration. The data can be imported using following code.
library(DMCHMM)
fn <- list.files(system.file("extdata",package = "DMCHMM"))
fn.f <- list.files(system.file("extdata",package="DMCHMM"), full.names=TRUE)
OBJ <- readBismark(fn.f, fn, mc.cores = 2)
## | | | 0%
## | |======================= | 33%
## | |=============================================== | 67%
## | |======================================================================| 100%
##
## Building BSData object.
cdOBJ <- DataFrame(Cell = factor(c("BC", "TC","Mono"),
labels = c("BC", "TC", "Mono")), row.names = c("BCU1568","BCU173","BCU551"))
colData(OBJ) <- cdOBJ
OBJ
## class: BSData
## dim: 25668 3
## metadata(0):
## assays(2): totalReads methReads
## rownames(25668): 1 2 ... 25667 25668
## rowData names(0):
## colnames(3): BCU1568 BCU173 BCU551
## colData names(1): Cell
The above data set only include one sample for each cell type. We need more samples to be able to compare their methylations and find DMCs. For illustration we generate a sample of BS-Seq data as follows.
nr <- 150; nc <- 8
metht <- matrix(as.integer(runif(nr * nc, 0, 20)), nr)
methc <- matrix(rbinom(n=nr*nc,c(metht),prob = runif(nr*nc)),nr,nc)
r1 <- GRanges(rep("chr1", nr), IRanges(1:nr, width=1), strand="*")
names(r1) <- 1:nr
cd1 <- DataFrame(Group=rep(c("G1","G2"),each=nc/2),row.names=LETTERS[1:nc])
OBJ1 <- cBSData(rowRanges=r1,methReads=methc,totalReads=metht,colData=cd1)
OBJ1
## class: BSData
## dim: 150 8
## metadata(0):
## assays(2): totalReads methReads
## rownames(150): 1 2 ... 149 150
## rowData names(0):
## colnames(8): A B ... G H
## colData names(1): Group
There are two approaches to smoothed the data before testing for
DMCs. Either EM or MCMC can be used to predict methylation levels
utilizing HMM. The methHMEM-method
which is developed to
predict methylation levels. The output is a BSDMCs-class
that can be either used to find DMCs or use MCMC algorithm to re-smooth
the raw data. The process is as follows.
## | | | 0% | |========= | 12% | |================== | 25% | |========================== | 38% | |=================================== | 50% | |============================================ | 62% | |==================================================== | 75% | |============================================================= | 88% | |======================================================================| 100%
## class: BSDMCs
## dim: 150 8
## metadata(3): K Beta Pm
## assays(5): methReads totalReads methLevels methStates methVars
## rownames(150): 1 2 ... 149 150
## rowData names(0):
## colnames(8): A B ... G H
## colData names(1): Group
Although EM algorithm is a fast way to smooth the data but the
results are not as good as the MCMC algorithm. The MCMC algorithm,
however, is slow. In order to increase the speed, we first use
methHMEM-method
to find the HMM order for each sample and
then we use methHMCMC-method
to predict methylation levels.
The procedure is as follows.
## | | | 0% | |========= | 12% | |================== | 25% | |========================== | 38% | |=================================== | 50% | |============================================ | 62% | |==================================================== | 75% | |============================================================= | 88% | |======================================================================| 100%
## class: BSDMCs
## dim: 150 8
## metadata(3): K Beta Pm
## assays(5): methReads totalReads methLevels methStates methVars
## rownames(150): 1 2 ... 149 150
## rowData names(0):
## colnames(8): A B ... G H
## colData names(1): Group
Having smoothed the data using HMM, we run linear between predicted
methylation levels and grouping covariate. In case other covariates
exist, one can use the formula
argument to specify a linear
model. When there is no covariates no action is required. The following
command identifies the DMCs. The results are stored in a
BSDMCs-class
and can be retrieved by calling
metadata
command.
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
## Warning in fdrtool(x, statistic = "pvalue", plot = FALSE, verbose = FALSE):
## There may be too few input test statistics for reliable FDR calculations!
## DMCs pvalues qvalues DMCsGroupG1vsG2 methDirGroupG1vsG2
## 1 0 0.14681254 1 0 hypo
## 2 0 0.54534532 1 0 hypo
## 3 0 0.00786713 1 0 hypo
## 4 0 0.47436236 1 0 hypo
## 5 0 0.22184830 1 0 hypo
## 6 0 0.54870285 1 0 hypo
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DMCHMM_1.29.0 fdrtool_1.2.18
## [3] BiocParallel_1.41.0 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.0
## [7] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [9] S4Vectors_0.45.0 BiocGenerics_0.53.0
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
##
## loaded via a namespace (and not attached):
## [1] sandwich_3.1-1 sass_0.4.9 bitops_1.0-9
## [4] SparseArray_1.7.0 lattice_0.22-6 digest_0.6.37
## [7] evaluate_1.0.1 grid_4.5.0 calibrate_1.7.7
## [10] mvtnorm_1.3-1 fastmap_1.2.0 jsonlite_1.8.9
## [13] Matrix_1.7-1 restfulr_0.0.15 survival_3.7-0
## [16] multcomp_1.4-26 httr_1.4.7 UCSC.utils_1.3.0
## [19] TH.data_1.1-2 Biostrings_2.75.0 XML_3.99-0.17
## [22] codetools_0.2-20 jquerylib_0.1.4 abind_1.4-8
## [25] cli_3.6.3 rlang_1.1.4 crayon_1.5.3
## [28] XVector_0.47.0 splines_4.5.0 cachem_1.1.0
## [31] DelayedArray_0.33.0 yaml_2.3.10 S4Arrays_1.7.0
## [34] tools_4.5.0 parallel_4.5.0 Rsamtools_2.23.0
## [37] GenomeInfoDbData_1.2.13 curl_5.2.3 R6_2.5.1
## [40] BiocIO_1.17.0 zoo_1.8-12 lifecycle_1.0.4
## [43] rtracklayer_1.67.0 zlibbioc_1.53.0 MASS_7.3-61
## [46] bslib_0.8.0 GenomicAlignments_1.43.0 xfun_0.48
## [49] knitr_1.48 rjson_0.2.23 htmltools_0.5.8.1
## [52] rmarkdown_2.28 compiler_4.5.0 RCurl_1.98-1.16