Contents
Introduction
The InTAD analysis is focused on the processing of generated object that
combines all input datasets. Required input data is the following:
- epigenetic signals data.frame i.e. enhancers along with their coordinates in
GRanges format
- gene expression counts data.frame along with gene coordinates
in GRanges format
- TAD borders GRanges i.e. result of HiC technique application
Further explained example of performing the analysis procedure is based on
H3K27ac data reflecting activity of enhancers in medulloblastoma brain tumour
descrbied in the manuscript from C.Y.Lin, S.Erkek et al., Nature,
2016.
This dataset includes normalized enhancer signals obtained from H3K27ac
ChIP-seq data and RNA-seq gene expression RPKM counts from 25 medulloblastoma
samples. The test subset is extracted from a selected region inside
chromosome 15. Additionally, the coordinates for enhancers and genes along
with specific sample annotation are provided.
The analysis starts from preparing and loading the data. Here is the
overview of integrated input test data, that can serve as a useful example
describing supported input formats:
library(InTAD)
# normalized enhancer signals table
enhSel[1:3,1:3]
## MB176 MB95 MB40
## chr15:26003055-26004347 0.1936792 0.8005544 -0.4829619
## chr15:26007354-26009802 0.6658638 1.3044401 0.7630649
## chr15:26012427-26015263 -1.3077170 1.0948566 -0.6673826
# enhancer signal genomic coordinates
as.data.frame(enhSelGR[1:3])
## seqnames start end width strand
## 1 chr15 26003055 26004347 1293 *
## 2 chr15 26007354 26009802 2449 *
## 3 chr15 26012427 26015263 2837 *
# gene expression normalized counts
rpkmCountsSel[1:3,1:3]
## MB176 MB95 MB40
## ENSG00000215567.4 0 0.000000 0
## ENSG00000201241.1 0 0.000000 0
## ENSG00000258463.1 0 4.183154 0
# gene coordiantes
as.data.frame(txsSel[1:3])
## seqnames start end width strand gene_id gene_name
## 1 chr15 20083769 20093074 9306 + ENSG00000215567.4 RP11-79C23.1
## 2 chr15 20088867 20088969 103 + ENSG00000201241.1 RNU6-978P
## 3 chr15 20104587 20104812 226 + ENSG00000258463.1 RP11-173D3.3
## gene_type
## 1 pseudogene
## 2 snRNA
## 3 pseudogene
# additional sample info data.frame
head(mbAnnData)
## Subgroup Age Gender Histology M.Stage
## MB176 WNT 9 F Classic M0
## MB95 Group3 3 M Classic M0
## MB40 Group4 3 M Classic M0
## MB37 SHH 1 F Desmoplastic M0
## MB38 Group4 6 M Desmoplastic M0
## MB28 SHH 1 M Desmoplastic M0
Importantly, there are specific requriements for the input datasets. The names
of samples should match in signals and gene expression datasets.
summary(colnames(rpkmCountsSel) == colnames(enhSel))
## Mode TRUE
## logical 25
Next, the genomic regions should be provided for each signal as well as for
each gene.
# compare number of signal regions and in the input table
length(enhSelGR) == nrow(enhSel)
## [1] TRUE
The genomic regions reflecting the gene coordinates must include “gene_id”
and “gene_name” marks. These are typical GTF format markers. One more mark
“gene_type” is also useful to perform filtering of gene expression matrix.
All the requirements are checked during the generation of the InTADSig
object. Main part of this object is MultiAssayExperiment subset
that combines signals and gene expression. Specific annotation information
about samples can be also included for further control and visualization. In
provided example for medulloblastoma samples annotation contains various
aspects such as tumour subgroup, age, gender, etc.
inTadSig <- newSigInTAD(enhSel, enhSelGR, rpkmCountsSel, txsSel,mbAnnData)
## Created signals and genes object for 25 samples
The created object contains MultiAssayExperiment that includes both signals and
gene expression data.
inTadSig
## S4 InTADSig object
## Num samples: 25
## Num signals: 65
## Num genes: 2080
During the main object generation there are also available special options to
activate parallel computing based on usage of R multi-thread librares
and log2 adjustment for gene expression. The generated data subsets can be
accessed using specific call functions on the object i.e. signals or exprs.
Notably, the main object can be also loaded from the text files representing
the input data using function loadSigInTAD. Refer to the documetation of this
function for more details.
Main data analysis
General first from gene expression counts matrix filtering with non- or low expressed genes. However if this counts were not filtered before starting the InTAD analysis it’s possible to adjust gene expression limits using function filterGeneExpr. This function provides parameters to control
minimum gene expression and type. There is additionally a special option to
compute gene expression distribution based on usage of mclust
package in order to find suitable minimum gene expression cut limit.
Here’s example how to activate this:
# filter gene expression
inTadSig <- filterGeneExpr(inTadSig, checkExprDistr = TRUE)
## Initial result: 2080 genes
## Gene expression cut value: 1.79413740491486
## Filtered result: 671 genes
The analysis starts from the combination of signals and genes inside the TADs.
Since the TADs are known to be stable across various cell types, it’s possible
to use already known TADs obtained from IMR90 cells using HiC technology
(Dixon et al 2012). The human
IMR90 TADs regions object is integrated into the package.
# IMR90 hg19 TADs
head(tadGR)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## chr1:770137-1250137 chr1 770137-1250137 *
## chr1:1250137-1850140 chr1 1250137-1850140 *
## chr1:1850140-2330140 chr1 1850140-2330140 *
## chr1:2330140-3650140 chr1 2330140-3650140 *
## chr1:4660140-6077413 chr1 4660140-6077413 *
## chr1:6077413-6277413 chr1 6077413-6277413 *
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
However, since the variance is actually observed between TAD calling methods
(i.e. described in detailed review by Rola Dali and Mathieu Blanchette, NAR
2017 ), novel obtained
TADs can be also applied for the analysis. The requried format: GRanges object.
Composition of genes and signals in TADs is performed using function
combineInTAD that has several options. By default, it marks the signal
belonging to the TAD by largest overlap and also takes into account genes that
are not overlaping the TADs by connecting them to the closest TAD. This can be
sensetive strategy since some genomic regions can be missed due to the limits
of input HiC data and variance of existing TAD calling methods.
# combine signals and genes in TADs
inTadSig <- combineInTAD(inTadSig, tadGR)
## Combined 520 signal-gene pairs in TADs
Final step is the correlation analysis. Various options are avialble for this
function i.e. correlation method, computation of q-value to control the
evidence strength and visualization of connection proportions. This last option
allows to show differences in gene and signal regulations.
par(mfrow=c(1,2)) # option to combine plots in the graph
# perform correlation anlaysis
corData <- findCorrelation(inTadSig,plot.proportions = TRUE)
The result data.frame has a special format. It includes each signal, TAD, gene
and correlation information.
head(corData,5)
## peakid tad gene
## 1 chr15:26003055-26004347 chr15:25728907-27128907 ENSG00000114062.13
## 2 chr15:26003055-26004347 chr15:25728907-27128907 ENSG00000261529.1
## 3 chr15:26003055-26004347 chr15:25728907-27128907 ENSG00000206190.7
## 4 chr15:26003055-26004347 chr15:25728907-27128907 ENSG00000166206.9
## 5 chr15:26003055-26004347 chr15:25728907-27128907 ENSG00000235518.2
## name cor pvalue eucDist corRank
## 1 UBE3A -0.04096321 0.8458537 26.412693 8
## 2 RP13-487P22.1 0.04206502 0.8417581 6.995550 6
## 3 ATP10A 0.09644601 0.6465118 6.644602 5
## 4 GABRB3 0.02421908 0.9085147 22.698792 7
## 5 AC011196.3 0.13491773 0.5202224 7.540166 3
Further filtering of this result data can be performed by adjusting p-value and
correlation effect limits (i.e. p-val < 0.01, positive correlation only).
Visualization
The package provides post-analysis visualization function: the specific signal
and gene can be selected for correlation plot generation. Here’s example of
verified medulllobastoma Group3-specifc enhancer assoicated gene GABRA5 lying
in the same TAD as the enhancer, but not close to the gene:
# example enhancer in correlation with GABRA5
cID <- "chr15:26372163-26398073"
selCorData <- corData[corData$peakid == cID, ]
selCorData[ selCorData$name == "GABRA5", ]
## peakid tad gene name
## 230 chr15:26372163-26398073 chr15:25728907-27128907 ENSG00000186297.7 GABRA5
## cor pvalue eucDist corRank
## 230 0.878531 7.724306e-09 10.92154 1
For the plot generation it is required to provide the signal id and gene name:
plotCorrelation(inTadSig, cID, "GABRA5",
xLabel = "RPKM gene expr log2",
yLabel = "H3K27ac enrichment log2",
colByPhenotype = "Subgroup")
## ENSG00000186297.7
Note that in the visualization it’s also possible to mark the colours
representing the samples using option colByPhenotype based on the sample
annotation information included in the generation of the main object. In the
provided example medulloblastma tumour subgroups are marked.
Specific genomic region of interest can be also visualised to observe the
variance and impact of TADs using special function that works on result
data.frame obtained from function findCorrelation. The resulting plot
provides the location of signals in X-axis and genes in Y-axis. Each point
reflects the correlation stength based on p-value: -log10(P-val). This
visualization strategy was introduced in the study by S. Waszak et al, Cell,
2015
focused on investigation of chromatin architecture in human cells.
By default only detected TADs with signals inside are visualized,
but it is also possible to include all avaialble TAD regions using special
option. Here’s the example plot covering the whole chromosome 15 region used
in the test dataset:
plotCorAcrossRef(inTadSig,corData,
targetRegion = GRanges("chr15:25000000-28000000"),
tads = tadGR)
One more option of this function allows to activaite representation of postive
correlation values from 0 to 1 instead of strength.
plotCorAcrossRef(inTadSig,corData,
targetRegion = GRanges("chr15:25000000-28000000"),
showCorVals = TRUE, tads = tadGR)
It’s also possible to focus on the connections by ignoring the signal/gene
locations and focusing only on correlation values by adjusting for symmetery.
This is typical approach used for HiC contact data visualization in such
tools as for example JuiceBox. This can be activate by using the corresponding option:
plotCorAcrossRef(inTadSig,corData,
targetRegion = GRanges("chr15:25000000-28000000"),
showCorVals = TRUE, symmetric = TRUE, tads = tadGR)
These visualization strategies allow to investigate the impact of TADs.
Additional documentation is available for each function via standard R help.
Session info
Here is the output of sessionInfo()
on the system on which this
document was compiled:
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] InTAD_1.4.0 MultiAssayExperiment_1.10.0
## [3] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [5] BiocParallel_1.18.0 matrixStats_0.54.0
## [7] Biobase_2.44.0 GenomicRanges_1.36.0
## [9] GenomeInfoDb_1.20.0 IRanges_2.18.0
## [11] S4Vectors_0.22.0 BiocGenerics_0.30.0
## [13] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] qvalue_2.16.0 tidyselect_0.2.5
## [3] xfun_0.6 purrr_0.3.2
## [5] reshape2_1.4.3 splines_3.6.0
## [7] lattice_0.20-38 colorspace_1.4-1
## [9] htmltools_0.3.6 rtracklayer_1.44.0
## [11] yaml_2.2.0 XML_3.98-1.19
## [13] rlang_0.3.4 pillar_1.3.1
## [15] ggpubr_0.2 glue_1.3.1
## [17] GenomeInfoDbData_1.2.1 plyr_1.8.4
## [19] stringr_1.4.0 zlibbioc_1.30.0
## [21] Biostrings_2.52.0 munsell_0.5.0
## [23] gtable_0.3.0 evaluate_0.13
## [25] labeling_0.3 knitr_1.22
## [27] Rcpp_1.0.1 scales_1.0.0
## [29] BiocManager_1.30.4 XVector_0.24.0
## [31] Rsamtools_2.0.0 ggplot2_3.1.1
## [33] digest_0.6.18 stringi_1.4.3
## [35] bookdown_0.9 dplyr_0.8.0.1
## [37] grid_3.6.0 tools_3.6.0
## [39] bitops_1.0-6 magrittr_1.5
## [41] RCurl_1.95-4.12 lazyeval_0.2.2
## [43] tibble_2.1.1 crayon_1.3.4
## [45] pkgconfig_2.0.2 Matrix_1.2-17
## [47] assertthat_0.2.1 rmarkdown_1.12
## [49] R6_2.4.0 mclust_5.4.3
## [51] GenomicAlignments_1.20.0 compiler_3.6.0
References
Dali, R. and Blanchette, M., 2017. A critical assessment of topologically
associating domain prediction tools. Nucleic acids research, 45(6),
pp.2994-3005.
Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M.,
Liu, J.S. and Ren, B., 2012. Topological domains in mammalian genomes
identified by analysis of chromatin interactions. Nature, 485(7398),
p.376.
Lin, C.Y., Erkek, S., Tong, Y., Yin, L., Federation, A.J., Zapatka, M.,
Haldipur, P., Kawauchi, D., Risch, T., Warnatz, H.J. and Worst, B.C., 2016.
Active medulloblastoma enhancers reveal subgroup-specific cellular origins.
Nature, 530(7588),
p.57.
Waszak, S.M., Delaneau, O., Gschwind, A.R., Kilpinen, H., Raghav, S.K.,
Witwicki, R.M., Orioli, A., Wiederkehr, M., Panousis, N.I., Yurovsky, A.
and Romano-Palumbo, L., 2015. Population variation and genetic control of modular chromatin architecture in humans. Cell,
162(5)