TRESS 1.0.0
The post-transcriptional epigenetic modification on mRNA is an emerging field to study the gene regulatory mechanism and their association with diseases. Recently developed high-throughput sequencing technology named Methylated RNA Immunoprecipitation Sequencing (MeRIP-seq) enables one to profile mRNA epigenetic modification transcriptome-wide.
In MeRIP-seq, mRNA is first fragmented into approximately 100-nucleotide-long oligonucleotides, and then immunoprecipitated by an anti-m\(^6\)A affinity purified antibody. In addition to the immunoprecipitated (IP) samples, libraries are also prepared for input control fragments to measure the corresponding reference mRNA abundance. This process is an RNA-seq experiment. After sequencing, the reads from both the IP and the input samples are aligned to the reference genome. Due to the enrichment from IP process, transcriptomic regions with m\(^6\)A will have more reads clustered and have peak-like shapes when visualizing the read counts along the genome. Therefore, people often refer the m\(^6\)A regions as “peaks”, which is a term usually used in ChIP-seq to represent the protein binding sites. Figure 1 shows some example peaks on the Fat2 gene from a dataset to study m\(^6\)A dynamics during mouse brain development, where m\(^6\)A in cerebellums from 2-week old mice are profiled with two biological replicates.
Two major tasks in the analysis of MeRIP-seq data are to identify transcriptome-wide m\(^6\)A methylation (namely “peak calling”), and differential m6A methylation.
For the first problem, TRESS builds a two-step procedure to identify transcriptome wide m\(^6\)A regions. In the first step, it quickly scans the whole transcriptome and losely identify candidate regions using an ad hoc algorithm. In the second step, it models read counts from candidate regions using an empirical hierarchical negative binomial model to accounts for all-sources variations. It also imposes a prior on the dispersion of methylation, which induces a shrinkage variance estimate by borrowing information from all candidate regions. Wald test is constructed to detect significant m\(^6\)A regions from the candidates. Metthod for the second problem is under development and will be available soon.
From GitHub:
install.packages("devtools") # if you have not installed "devtools" package
library(devtools)
install_github("https://github.com/ZhenxingGuo0015/TRESS",
build_vignettes = TRUE)
To view the package vignette in HTML format, run the following lines in R
library(TRESS)
vignette("TRESS")
TRESS requires paired
input control and IP BAM files for each replicate of all samples:
“input1.bam & ip1.bam”, “input2.bam & ip2.bam”, …. in order to call peaks from all replicates.
The BAM files contain mapped reads sequenced from
respective samples and are the output of sequence alignment tools
like Bowtie2
.
For illustration purpose, we include four example BAM files and one corresponding genome annotation file in our publicly available data package datasetTRES
on github,
which can be installed with
install_github("https://github.com/ZhenxingGuo0015/datasetTRES")
The BAM files contain sequencing reads (only on chromosome 19) from two input & IP mouse brain cerebellum samples.
In addition to BAM files, TRESS also needs a path to
an annotation file in order to obtain transcriptome-wide bins,
bin-level read counts and annotation of each peak.
The annotation file is actually is a TXDB and is saved in
format of *.sqlite
, which is easily created using R function
makeTxDbFromUCSC()
from Bioconductor package GenomicFeatures
:
## Directly use "makeTxDbFromUCSC" function to create one
library(GenomicFeatures)
txdb = makeTxDbFromUCSC("mm9", "knownGene")
# saveDb(txdb, file = paste0("YourPATH", "/", "YourGenome.sqlite")
## or load a TxDb annotation package like
# library(TxDb.Mmusculus.UCSC.mm9.knownGene)
# txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
# saveDb(txdb, file = paste0("YourPATH", "/", "mm9_knownGene.sqlite")
If one already has one .gtf file, then a TXDB file can be created using the following code:
library(GenomicFeatures)
txdb = makeTxDbFromGFF("directory/to/your/xxx.gtf", format = "gtf")
# saveDb(txdb, file = paste0("YourPATH", "/", "YourGenome.sqlite")
TRESS provides one example MeRIP-seq dataset.
The raw sequencing reads were downloaded
from GEO, and mapped using Bowtie2
to generate
corresponding BAM files.
With BAM files, we apply TRESS to obtain transcriptome
bins and candidate regions. A small subset of both bin-level
and region-level data are stored as examples in
TRESS for package illustration purpose.
The example dataset Basal
comes from 7 basal
mouse brain samples (GEO accession number GSE113781).
The total number of bins and candidate regions for
original data are 1,620,977 and 13,017 respectively.
Due to space limit,
only data of randomly selected 1000 bins and
500 regions (bins and regions are not necessarily overlapped)
are saved. In particular,
for each bin and candidate region,
both the genomic coordinate and read counts
(across 7 paired input & ip replicates) are kept.
library(TRESS)
data("Basal")
Bins <- Basal$Bins$Bins
BinCounts <- Basal$Bins$Counts
dim(BinCounts)
## [1] 1000 14
Candidates <- Basal$Candidates$Regions
CandidatesCounts <- Basal$Candidates$Counts
sf <- Basal$Bins$sf
Check the coordinates of each bin and candidate region.
head(Bins, 3)
## chr start end width strand
## 1 chr1 3195951 3196000 50 -
## 2 chr1 3196001 3196050 50 -
## 3 chr1 3196051 3196100 50 -
head(Candidates, 3)
## chr start end strand summit
## 1 chr9 90099501 90099700 - 90099625
## 2 chr5 145620001 145620200 + 145620075
## 3 chr2 130408151 130408300 - 130408275
Check the read counts in each bin and candidate region.
dim(BinCounts)
## [1] 1000 14
head(BinCounts, 3)
## basal_Input_rep1 basal_IP_rep1 basal_Input_rep2 basal_IP_rep2
## [1,] 2 7 3 0
## [2,] 4 6 8 0
## [3,] 44 8 26 2
## basal_Input_rep3 basal_IP_rep3 basal_Input_rep4 basal_IP_rep4
## [1,] 4 2 2 2
## [2,] 10 2 4 2
## [3,] 22 2 26 12
## basal_Input_rep5 basal_IP_rep5 basal_Input_rep6 basal_IP_rep6
## [1,] 6 0 4 1
## [2,] 10 0 6 2
## [3,] 14 2 26 4
## basal_Input_rep7 basal_IP_rep7
## [1,] 1 0
## [2,] 6 2
## [3,] 13 6
dim(CandidatesCounts)
## [1] 500 14
head(CandidatesCounts,3)
## basal_Input_rep1 basal_IP_rep1 basal_Input_rep2 basal_IP_rep2
## 1 95 2206 92 1430
## 2 750 20841 797 12699
## 3 58 1857 78 1320
## basal_Input_rep3 basal_IP_rep3 basal_Input_rep4 basal_IP_rep4
## 1 150 1810 164 2380
## 2 1013 17832 956 23479
## 3 112 1449 161 2452
## basal_Input_rep5 basal_IP_rep5 basal_Input_rep6 basal_IP_rep6
## 1 122 1415 135 1559
## 2 963 15379 859 15951
## 3 111 1762 118 1625
## basal_Input_rep7 basal_IP_rep7
## 1 80 988
## 2 831 12311
## 3 111 1357
Check the library size factor of each sample in this data.
sf
## basal_Input_rep1 basal_IP_rep1 basal_Input_rep2 basal_IP_rep2
## 0.6807884 1.7213164 0.7472330 1.1091955
## basal_Input_rep3 basal_IP_rep3 basal_Input_rep4 basal_IP_rep4
## 0.7403452 1.3782818 1.0150407 2.2242626
## basal_Input_rep5 basal_IP_rep5 basal_Input_rep6 basal_IP_rep6
## 0.7245767 1.2234676 0.6533518 1.2120361
## basal_Input_rep7 basal_IP_rep7
## 0.6148696 1.1139263
Peak calling in TRESS is performed using one wrapper function
TRESS_peak()
. Here we use example BAM files from datasetTRES
as an example.
## Directly take BAM files in "datasetTRES" available on github
library(datasetTRES)
Input.file = c("cb_input_rep1_chr19.bam", "cb_input_rep2_chr19.bam")
IP.file = c("cb_ip_rep1_chr19.bam", "cb_ip_rep2_chr19.bam")
BamDir = file.path(system.file(package = "datasetTRES"), "extdata/")
annoDir = file.path(system.file(package = "datasetTRES"),
"extdata/mm9_chr19_knownGene.sqlite")
OutDir = "/directory/to/output"
TRESS_peak(IP.file = IP.file,
Input.file = Input.file,
Path_To_AnnoSqlite = annoDir,
InputDir = BamDir,
OutputDir = OutDir, # specify a directory for output
experiment_name = "examplebyBam", # name your output
filetype = "bam")
### example peaks
peaks = read.table(file.path(system.file(package = "TRESS"),
"extdata/examplebyBam_peaks.xls"),
sep = "\t", header = TRUE)
head(peaks[, -c(5, 14, 15)], 3)
## chr start end strand lg.fc cb_input_rep1_chr19.bam
## 1 chr19 23239316 23240265 + 1.357281 529
## 2 chr19 6531415 6531714 + 1.016896 693
## 3 chr19 12844318 12844767 - 1.010331 268
## cb_ip_rep1_chr19.bam cb_input_rep2_chr19.bam cb_ip_rep2_chr19.bam mu
## 1 6092 380 2479 0.8630350
## 2 8220 571 2523 0.8466991
## 3 3848 200 1153 0.8753663
## mu.var stats pvals p.adj rScore
## 1 0.001154263 13.203318 0 0 2.850148
## 2 0.001094432 13.065621 0 0 2.433220
## 3 0.002193997 9.839997 0 0 2.330556
To replace the example BAM files with your BAM files, the codes are:
## or, take BAM files from your path
Input.file = c("input_rep1.bam", "input_rep2.bam")
IP.file = c("ip_rep1.bam", "ip_rep2.bam")
BamDir = "/directory/to/BAMfile"
annoDir = "/path/to/xxx.sqlite"
OutDir = "/directory/to/output"
TRESS_peak(IP.file = IP.file,
Input.file = Input.file,
Path_To_AnnoSqlite = annoDir,
InputDir = BamDir,
OutputDir = OutDir,
experiment_name = "xxx",
filetype = "bam")
peaks = read.table(paste0(OutDir, "/",
"xxx_peaks.xls"),
sep = "\t", header = TRUE)
head(peaks, 3)
As a wrapper function, TRESS_peak()
combines multiple subfunctions to
detect m6A regions transcriptome-wide.
The whole process involves following steps:
\(\quad 1\). Divide genome into equal sized bins and obtain read counts in each replicate using
DivideBins()
.
\(\quad 2\). Call candidate regions based on bin-level data across all replicates using
CallCandidates()
\(\quad 3\). Fit Negative binomial models for read counts in candidate regions to
detect and rank peaks among all candidates using CallPeaks.multiRep()
.
If there is only one replicate, functions in step 2
and step 3 will be replaced by
one function CallPeaks.oneRep()
, while function DivideBins()
will still be used to obtain bin-level data. Please see the man pages for
detailed usage of DivideBins()
. Given bin-level data, peak calling
by sub-functions are included in following
Section 4.1 and Section 4.2.
With bin-level read counts, binomial test is first conducted
for each bin. Then an ad hoc bump-finding algorithm
is applied to merge significant bins
(and/or bins with large fold change) and form bumps in each
replicate.
Bumps from all input & IP replicates are unioned together to
construct a list of candidate regions.
Both binomial tests and bump-finding are done by
function CallCandidates()
in TRESS.
Here, we use example dataset Basal
introduced in Section 3 to run CallCandidates()
.
## load in first example dataset
data("Basal")
Candidates = CallCandidates(
Counts = Basal$Bins$Counts,
bins = Basal$Bins$Bins
)
## Merge bumps from different replicates...
After obtaining candidate regions, TRESS models read counts
in candidate regions using a hierarchical negative-binomial model.
Wald tests are then conducted to detect significant regions
from candidates, where a region with methylation level significantly higher than the background is considered as sigificant.
The background methylation level is estimated based on
total read counts
from non-candidate regions, which is calculated as the total
bin-counts subtracted by counts from candidate regions.
This can be obtained using function BgMethylevel()
in TRESS.
With background methylation level estimated,
complete parameter estimation and statistical inference
for candidate regions are achieved by function
CallPeaks.multiRep()
in TRESS, where argument Candidates
is a list containing at least two components: genomic coordinates
(e.g., “chr”, “start” and “end” etc) of all candidate regions
and their read counts across all samples.
Here is an example, provided that the background
methylation level is 0.5,
data("Basal") ### load candidate regions
Basal$Candidates$sf = Basal$Bins$sf
peaks1 = CallPeaks.multiRep(
Candidates = Basal$Candidates,
mu.cutoff = 0.5
)
## The number of Peaks in this sample is:
## 500
head(peaks1, 3)
## chr start end strand summit lg.fc basal_Input_rep1
## 465 chr3 32361601 32361900 + 32361775 1.714305 539
## 443 chr1 154366001 154366350 - 154366125 1.822726 1143
## 325 chr2 138109901 138110250 + 138110175 2.024479 1781
## basal_IP_rep1 basal_Input_rep2 basal_IP_rep2 basal_Input_rep3 basal_IP_rep3
## 465 16355 561 9555 620 12208
## 443 40907 1448 24974 1643 30279
## 325 55462 2442 33444 2087 42729
## basal_Input_rep4 basal_IP_rep4 basal_Input_rep5 basal_IP_rep5
## 465 656 24306 542 11419
## 443 1660 53566 1823 30305
## 325 2644 72436 2219 39531
## basal_Input_rep6 basal_IP_rep6 basal_Input_rep7 basal_IP_rep7 mu mu.var
## 465 574 10282 535 11448 0.923 0.00014
## 443 1698 30126 1473 26076 0.917 0.00014
## 325 1996 40298 2057 33851 0.914 0.00014
## stats shrkPhi shrkTheta pvals p.adj rScore
## 465 35.368 0.001008 9.974 0 0 6.907
## 443 34.990 0.001008 9.974 0 0 6.439
## 325 34.826 0.001008 9.974 0 0 6.235
A word about CallPeaks.multiRep()
Different criteria are available in TRESS to filter peaks based on results of Wald tests, including p-values, FDR and
log fold change (logFC).
One needs to first specify a criterion through argument WhichThreshold
and
then provide a cutoff for that criterion
through one or two arguments named *.cutoff
.
By default, TRESS adopts both FDR and
logFC (throughWhichThreshold = "fdr_lfc"
) to select peaks.
The default cutoffs for FDR and logFC are 0.05 and 0.7
(for fold change of 2) respectively, meaning that peaks
with FDR < 0.05 and logFC > 0.7 will be kept.
With WhichThreshold = "fdr_lfc"
, one can
change the cutoffs to more stringent values,
e.g., FDR < 0.01 and logFC > 1.6 by setting
fdr.cutoff = 0.01
and lfc.cutoff = 1.6
:
### use different threshold to filter peaks
peaks2 = CallPeaks.multiRep(
Candidates = Basal$Candidates,
mu.cutoff = 0.5,
fdr.cutoff = 0.01,
lfc.cutoff = 1.6
)
## The number of Peaks in this sample is:
## 402
However, one can also set WhichThreshold = "fdr"
to select
peaks only with FDR, then by default peaks
with FDR < 0.05 will be kept:
### use different threshold to filter peaks
peaks3 = CallPeaks.multiRep(
Candidates = Basal$Candidates,
mu.cutoff = 0.5,
WhichThreshold = "fdr"
)
## The number of Peaks in this sample is:
## 500
Please read the manual page of function CallPeaks.multiRep()
for more details on each argument.
Given the usage of function “CallPeaks.multiRep()”,
it can also be adopted to re-rank existing peaks with our
developed methods, if their read counts are available.
This may perform bad if one didn’t properly
estimate library size factor (one argument of Candidates
)
for each sample.
Based on our experience, the estimation of size factor
should be based on the bin-level counts across the
whole transcriptome, not the region-level counts.
For background methylation level,
one can use 0.5 but it would be informative if
estimating it from the data.
The above two-step approach is for data with multiple replicates.
For data with only one replicate, TRESS_peak()
calls function CallPeaks.oneRep()
for peak calling given bin-level data. In this case,
bumps in the only one replicate are taken as final list of peaks.
The statistical significance of each peak comes from binomial tests. Here is an example of how to run function CallPeaks.oneRep()
.
# A toy example
data("Basal")
bincounts = Basal$Bins$Counts[, 1:2]
sf0 = Basal$Bins$sf[1:2]
bins = Basal$Bins$Bins
peaks = CallPeaks.oneRep(Counts = bincounts,
sf = sf0, bins = bins)
head(peaks, 3)
## chr start end strand summit basal_Input_rep1 basal_IP_rep1 pvals
## 8 chr1 4482151 4482500 - 4482375 228 5137 0
## 10 chr1 6204751 6204900 + 6204875 157 1544 0
## 7 chr1 3661001 3661400 - 3661325 537 6331 0
## p.adj lg.fc
## 8 0 0.6047136
## 10 0 0.5631771
## 7 0 0.5162146
With pre-called peaks in hand, one can visualize them using function “ShowOnePeak()” in TRESS. The usage of this function is
ShowOnePeak(onePeak, allBins, binCounts, ext = 500, ylim = c(0,1))
In order to run this function, one needs to have: 1) “onePeak”: a pre-called peak saved as a dataframe, which contains genomic positions for that peak: “chr”, “start”, “end”, “strand”; 2) “allBins”: genomic positions (“chr”, “start”, “end”, “strand”) of bins overlapping at least with the pre-called peak; 3) “binCounts”: the corresponding bin-level read counts in each replicate. This function will plot for each replicate: the methylation level of bins (blue bars) within the target peak (shade region in pink), and the normalized sequencing depth for input samples (curves in grey). We show some example plots here:
peaks = read.table(file.path(system.file(package = "TRESS"),
"extdata/examplebyBam_peaks.xls"),
sep = "\t", header = TRUE)
load(file.path(system.file(package = "TRESS"),
"extdata/examplebyBam.rda"))
allBins = as.data.frame(bins$bins)
colnames(allBins)[1] = "chr"
allBins$strand = binStrand
head(peaks, 1)
## chr start end strand summit lg.fc cb_input_rep1_chr19.bam
## 1 chr19 23239316 23240265 + 23240090 1.357281 529
## cb_ip_rep1_chr19.bam cb_input_rep2_chr19.bam cb_ip_rep2_chr19.bam mu
## 1 6092 380 2479 0.863035
## mu.var stats shrkPhi shrkTheta pvals p.adj rScore
## 1 0.001154263 13.20332 0.002447933 9.974182 0 0 2.850148
ShowOnePeak(onePeak = peaks[1,], allBins = allBins,
binCounts = allCounts)
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TRESS_1.0.0 S4Vectors_0.32.0 BiocGenerics_0.40.0
## [4] BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.6.0 Biobase_2.54.0
## [3] httr_1.4.2 sass_0.4.0
## [5] bit64_4.0.5 jsonlite_1.7.2
## [7] bslib_0.3.1 assertthat_0.2.1
## [9] BiocManager_1.30.16 highr_0.9
## [11] BiocFileCache_2.2.0 blob_1.2.2
## [13] GenomeInfoDbData_1.2.7 Rsamtools_2.10.0
## [15] yaml_2.2.1 progress_1.2.2
## [17] lattice_0.20-45 pillar_1.6.4
## [19] RSQLite_2.2.8 glue_1.4.2
## [21] digest_0.6.28 GenomicRanges_1.46.0
## [23] XVector_0.34.0 Matrix_1.3-4
## [25] htmltools_0.5.2 XML_3.99-0.8
## [27] pkgconfig_2.0.3 biomaRt_2.50.0
## [29] magick_2.7.3 bookdown_0.24
## [31] zlibbioc_1.40.0 purrr_0.3.4
## [33] BiocParallel_1.28.0 tibble_3.1.5
## [35] KEGGREST_1.34.0 generics_0.1.1
## [37] IRanges_2.28.0 ellipsis_0.3.2
## [39] SummarizedExperiment_1.24.0 cachem_1.0.6
## [41] GenomicFeatures_1.46.0 magrittr_2.0.1
## [43] crayon_1.4.1 memoise_2.0.0
## [45] evaluate_0.14 fansi_0.5.0
## [47] xml2_1.3.2 tools_4.1.1
## [49] prettyunits_1.1.1 hms_1.1.1
## [51] matrixStats_0.61.0 BiocIO_1.4.0
## [53] lifecycle_1.0.1 stringr_1.4.0
## [55] DelayedArray_0.20.0 AnnotationDbi_1.56.0
## [57] Biostrings_2.62.0 compiler_4.1.1
## [59] jquerylib_0.1.4 GenomeInfoDb_1.30.0
## [61] rlang_0.4.12 grid_4.1.1
## [63] RCurl_1.98-1.5 rjson_0.2.20
## [65] rappdirs_0.3.3 bitops_1.0-7
## [67] rmarkdown_2.11 restfulr_0.0.13
## [69] DBI_1.1.1 curl_4.3.2
## [71] R6_2.5.1 GenomicAlignments_1.30.0
## [73] knitr_1.36 dplyr_1.0.7
## [75] rtracklayer_1.54.0 fastmap_1.1.0
## [77] bit_4.0.4 utf8_1.2.2
## [79] filelock_1.0.2 stringi_1.7.5
## [81] Rcpp_1.0.7 vctrs_0.3.8
## [83] png_0.1-7 dbplyr_2.1.1
## [85] tidyselect_1.1.1 xfun_0.27