BindingSiteFinder 1.2.0
The BindingSiteFinder package is available at https://bioconductor.org and can be installed via BiocManager::install
:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("BindingSiteFinder")
Most cellular processes are regulated by RNA-binding proteins (RBPs). Knowledge on their exact positioning can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. In a recent publication we described a complete analysis workflow to detect RBP binding sites from iCLIP data. The workflow covers all essential steps, from quality control of sequencing reads, different peak calling options, to the downstream analysis and definition of binding sites. Pre-processing and peak calling steps relies on publicly available software, whereas the definition of the final binding sites follows a custom procedure implemented in R. This vignette explains how equally sized binding sites can be defined from a genome-wide iCLIP coverage.
The workflow described herein is based on our recently published complete iCLIP analysis pipeline (Busch et al. 2020). Thus, we expect the user to have preprocessed their iCLIP sequencing reads up to the point of the peak calling step. In brief, this includes basic processing of the sequencing reads, such as quality filtering, barcode handling, mapping and the generation of a single nucleotide crosslink file for all replicates under consideration. As we describe in our manuscript replicate .bam files may or may not be merged prior to peak calling, for which we suggest PureCLIP (Krakau, Richard, and Marsico 2017). For simplicity we address only the case where peak calling was based on the merge of all replicates.
Note: If you use BindingSiteFinder in published research, please cite:
Busch, A., Brüggemann, M., Ebersberger, S., & Zarnack, K. (2020) iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites. Methods, 178, 49-62. https://doi.org/10.1016/j.ymeth.2019.11.008
An optional step prior to the actual merging of crosslink sites into binding sites, is pre-filtering. Depending on the experiment type or sequencing depth it might be useful to retain only the most informative crosslink sites. In the case of PureCLIP called peaks are associated with a binding affinity strength score which can be used as a metric for pre-filtering.
The following example is based on four replicates of the RNA-binding protein U2AF65. Reads (as .bam files) from all four replicates were merged and subjected to PureCLIP for peak calling. The output comes in the form of .bed files which can be imported as a GRanges Object using the rtracklayer package. Here the PureCLIP socre is used to remove the 5% peaks with the lowest affinity scores.
library(rtracklayer)
csFile <- system.file("extdata", "PureCLIP_crosslink_sites_example.bed",
package="BindingSiteFinder")
cs = import(con = csFile, format = "BED")
cs
## GRanges object with 1000 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 629906 + | 3 27.30000
## [2] chr1 629911 + | 3 1.72212
## [3] chr1 629912 + | 3 15.68900
## [4] chr1 629913 + | 3 13.82090
## [5] chr1 629915 + | 3 6.36911
## ... ... ... ... . ... ...
## [996] chr1 2396116 + | 3 7.74693
## [997] chr1 2396990 + | 3 3.92442
## [998] chr1 2396991 + | 3 14.52750
## [999] chr1 2396992 + | 3 6.59018
## [1000] chr1 2397001 + | 3 1.27159
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(ggplot2)
quants = quantile(cs$score, probs = seq(0,1, by = 0.05))
csFilter = cs[cs$score >= quants[2]]
ggplot(data = data.frame(score = cs$score), aes(x = log2(score))) +
geom_histogram(binwidth = 0.5) +
geom_vline(xintercept = log2(quants[2])) +
theme_classic() +
xlab("PureCLIP score (log2)") +
ylab("Count")
As input the BindingSiteFinder package expects two types of data. First, a GRanges object of all ranges that should be merged into binding sites. In our example this was the created by PureCLIP and we just have to import the file. The second information is the per replicate coverage information in form of a table, which is created below.
# Load clip signal files and define meta data object
files <- system.file("extdata", package="BindingSiteFinder")
clipFilesP <- list.files(files, pattern = "plus.bw$", full.names = TRUE)
clipFilesM <- list.files(files, pattern = "minus.bw$", full.names = TRUE)
meta = data.frame(
id = c(1,2,3,4),
condition = factor(c("WT", "WT", "KD", "KD"),
levels = c("KD", "WT")),
clPlus = clipFilesP, clMinus = clipFilesM)
meta
## id condition
## 1 1 WT
## 2 2 WT
## 3 3 KD
## 4 4 KD
## clPlus
## 1 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep1_clip_plus.bw
## 2 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep2_clip_plus.bw
## 3 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep3_clip_plus.bw
## 4 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep4_clip_plus.bw
## clMinus
## 1 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep1_clip_minus.bw
## 2 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep2_clip_minus.bw
## 3 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep3_clip_minus.bw
## 4 /tmp/RtmpOPYcqb/Rinst12d02a558647b0/BindingSiteFinder/extdata/rep4_clip_minus.bw
This table has to have at minimum three columns, which must be named condition, clPlus and clMinus. The condition column holds information about for example treatment or KD/ KO experiments1 All four replicates are in fact of the same condition (WT), but two were assigned a different condition for demonstration purposes.. Specifying this column as a factor is important, since any downstream filtering options, which might differ between condition types, are match to the levels of the condition column. the clPlus and clMinus columns point towards the strand specific coverage for each replicate. This information will be imported as RLE objects upon object initialization. The number of ranges and crosslinks imported in the object can be shown once it has been constructed2 Here we load a previously compiled BindingSiteFinder DataSet to save disc space..
library(BindingSiteFinder)
bds = BSFDataSetFromBigWig(ranges = csFilter, meta = meta, silent = TRUE)
exampleFile <- list.files(files, pattern = ".rda$", full.names = TRUE)
load(exampleFile)
bds
## Object of class BSFDataSet
## Contained ranges: 19.740
## ----> Number of chromosomes: 68
## ----> Ranges width: 1
## Contained conditions: KD WT
Choosing how wide the final binding sites is not an easy choice and depends on various different factors. For that reason different filter options alongside with diagnostic plots were implemented to guide the decision. At first, a ratio between the crosslink events within binding sites and the crosslink events in adjacent windows to the binding sites can be computed. The higher the ratio, the better the associated binding site width captures the distribution of the underlying crosslink events. The supportRatioPlot function offers a quick screen for different width choices3 Note that given the little difference between 7nt and 9nt both options seem to be reasonable.
supportRatioPlot(bds, bsWidths = seq(from = 3, to = 19, by = 2))
To further guide the choice, a selection of potential width can be computed explicitly, here for width 3, 9, 19 and 29 while holding all other parameters constant.
bds1 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 3,
minCrosslinks = 2, minClSites = 1)
bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 2, minClSites = 1)
bds3 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 3,
minCrosslinks = 2, minClSites = 1)
bds4 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 3,
minCrosslinks = 2, minClSites = 1)
l = list(`1. bsSize = 3` = bds1, `2. bsSize = 9` = bds2,
`3. bsSize = 19` = bds3, `4. bsSize = 29` = bds4)
The effect of the binding site size choice can be visualized by looking at the total iCLIP coverage. Here each plot is centered around the binding site`s midpoint and the computed width is indicated by the gray frame. In our example size = 3 appears too small, since not all of the relevant peak signal seems to be captured. On the contrary size = 29 appears extremely large. Here we decided for size = 9 because it seems to capture the central coverage peak best.
rangeCoveragePlot(l, width = 20)
Once a decision on a binding site width has been made additional filtering options can be used to force more/ less stringent binding sites. Essentially an initially computed set of binding sites is filtered by a certain parameter. The minClSites parameter for instance allows the user to define how many of the initially computed crosslink sites should fall within the range of the computed binding site.
bds1 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 2, minClSites = 1)
bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 2, minClSites = 2)
bds3 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 2, minClSites = 3)
bds4 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 2, minClSites = 4)
As one might have expected the number of final binding sites that meet the threshold drops with higher constraints. The coverage on the other hand spans wider around the binding site center, resulting in a broader coverage.
l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2,
`3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
mergeSummaryPlot(l, select = "minClSites")
l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2,
`3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
rangeCoveragePlot(l, width = 20)
A further point to consider is the number of position within the binding site that are actually covered by iCLIP signal (crosslink events). The minCrosslinks parameter can be used to filter for that. In principal this parameter represents a ‘weaker’ version of the minClSites filter. It should be set in conjunction with the binding site width. Here for example we aim to have about one third of all position to be covered by a crosslink event.
bds10 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 3,
minCrosslinks = 1, minClSites = 1)
bds20 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
minCrosslinks = 3, minClSites = 1)
bds30 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 3,
minCrosslinks = 6, minClSites = 1)
bds40 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 3,
minCrosslinks = 10, minClSites = 1)
Since the initial PureCLIP run was based on the merged summary of all four replicates, an additional reproducibility filter must be employed. Based on the merged binding sites from the previous step we can now ask which of these sites are reproducible/ supported by the individual replicates. To do this we first have do decide on a replicates specific threshold. In the present case we decided for the 5% quantile for both conditions4 A lower boundary of 1 nucleotide is set as default minimum support..
reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.05))
reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.2))
If replicates from multiple different conditions are used in the binding site definition, also condition specific thresholds can be applied. For instance the KD condition might be more error prone than the WT, therefore we define a slightly more stringent threshold5 Note that the order of how multiple thresholds are applied is defined based on the order of the factor levels of condition. to account for that fact.
reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.05, 0.1))
reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.1, 0.05))
After one has decided which quantile cutoff to use for the replicates the number of replicates that must meet the selected threshold must be specified. This includes another level of regulation and allows for some variation, since not always all replicates are forced to agree on every binding sites. For instance we could implement the rule that 3 out of 4 replicates should agree for each binding site while setting the quantile cutoff to 5% for all 4 replicates. This can be achieved with the reproducibilityFilter function. When setting the returnType = “data.frame”, the function returns a plottable object instead of a final filtered GRanges object. The overlap among the four replicates after the filtering can be easily visualized as an upset plot, using the complexHeatmaps package.
s1 = reproducibilityFilter(bds1, cutoff = c(0.05), n.reps = c(3),
returnType = "data.frame")
library(ComplexHeatmap)
m1 = make_comb_mat(s1)
UpSet(m1)
Another more complex example could make use of both conditions. For instance we could implement the rule that a binding site should pass the filter if it is supported by 2 WT replicates, while for the KD only one replicate is sufficient. As quantile cutoffs we use 5% for the WT and 10% for the KD.
s2 = reproducibilityFilter(bds1, cutoff = c(0.1, 0.05), n.reps = c(1, 2),
returnType = "data.frame")
m2 = make_comb_mat(s2)
UpSet(m2)
Once all decisions on filtering parameters have been made the reproducibilityFilter function can be run. The final binding site object can be retrieved through the getRanges successor function. If for the reproducibility filtering different thresholds were used per condition, a range is reported if any of the two conditions yielded TRUE. The filter output per condition is written to the meta columns of the GRanges object which allows additional post-processing.
bdsFinal = reproducibilityFilter(bds1, cutoff = c(0.1, 0.05), n.reps = c(1, 2))
getRanges(bdsFinal)
## GRanges object with 3522 ranges and 2 metadata columns:
## seqnames ranges strand | KD WT
## <Rle> <IRanges> <Rle> | <logical> <logical>
## 1 chr22 11630136-11630144 + | TRUE TRUE
## 2 chr22 11630411-11630419 + | TRUE TRUE
## 3 chr22 15785521-15785529 + | TRUE TRUE
## 4 chr22 15785544-15785552 + | TRUE FALSE
## 5 chr22 15785960-15785968 + | TRUE FALSE
## ... ... ... ... . ... ...
## 3290 chr22 50776977-50776985 - | TRUE FALSE
## 3290 chr22 50776990-50776998 - | TRUE TRUE
## 3291 chr22 50777819-50777827 - | TRUE FALSE
## 3292 chr22 50777867-50777875 - | TRUE TRUE
## 3293 chr22 50783296-50783304 - | TRUE TRUE
## -------
## seqinfo: 68 sequences from an unspecified genome; no seqlengths
In case the initial peak calling was performed with PureCLIP we offer the annotateWithScore function to re-annotate computed binding sites with the PureCLIP score from each crosslink site.
bdsFinal = annotateWithScore(bdsFinal, getRanges(bds))
bindingSites = getRanges(bdsFinal)
bindingSites
## GRanges object with 3522 ranges and 5 metadata columns:
## seqnames ranges strand | KD WT scoreSum
## <Rle> <IRanges> <Rle> | <logical> <logical> <numeric>
## 1 chr22 11630136-11630144 + | TRUE TRUE 595.00512
## 2 chr22 11630411-11630419 + | TRUE TRUE 121.23981
## 3 chr22 15785521-15785529 + | TRUE TRUE 24.01861
## 4 chr22 15785544-15785552 + | TRUE FALSE 30.54464
## 5 chr22 15785960-15785968 + | TRUE FALSE 6.85856
## ... ... ... ... . ... ... ...
## 3290 chr22 50776977-50776985 - | TRUE FALSE 6.38966
## 3290 chr22 50776990-50776998 - | TRUE TRUE 60.35172
## 3291 chr22 50777819-50777827 - | TRUE FALSE 9.21558
## 3292 chr22 50777867-50777875 - | TRUE TRUE 54.12220
## 3293 chr22 50783296-50783304 - | TRUE TRUE 39.85535
## scoreMean scoreMax
## <numeric> <numeric>
## 1 148.75128 334.18700
## 2 30.30995 77.58330
## 3 4.80372 9.21879
## 4 10.18155 15.20870
## 5 3.42928 5.67521
## ... ... ...
## 3290 3.19483 3.19483
## 3290 20.11724 37.55070
## 3291 3.07186 5.38047
## 3292 18.04073 22.78030
## 3293 13.28512 16.22960
## -------
## seqinfo: 68 sequences from an unspecified genome; no seqlengths
After the definition of binding sites one is usually interested in the binding profile of the RBP. General question are: Which gene type does my RBP bind? If it binds to protein coding genes, to which part of the transcripts does it bind to? Answering these types of questions involves using an additional gene annotation resource to overlap the binding sites with. In the following we demonstrate how this could be achieved using GenomicRanges objects.
As gene annotation any desired source can be used, as long as it can be converted into a TxDb object. In the present example we use gencode gene annotations (hg38) from chromosome 22 imported from GFF3 format. We extract the ranges from the TxDb object and match them with additional information from the GFF3 file6 Note that we prepared this file beforehand and only load the final object (see /inst/scripts/PrepareExampleData.R).
library(GenomicFeatures)
# Make annotation database from gff3 file
annoFile = "./gencode.v37.annotation.chr22.header.gff3"
annoDb = makeTxDbFromGFF(file = annoFile, format = "gff3")
annoInfo = import(annoFile, format = "gff3")
# Get genes as GRanges
gns = genes(annoDb)
idx = match(gns$gene_id, annoInfo$gene_id)
elementMetadata(gns) = cbind(elementMetadata(gns),
elementMetadata(annoInfo)[idx,])
# Load GRanges with genes
geneFile <- list.files(files, pattern = "gns.rds$", full.names = TRUE)
load(geneFile)
gns
## GRanges object with 1387 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000008735.14 chr22 50600793-50613981 + | ENSG00000008735.14
## ENSG00000015475.19 chr22 17734138-17774770 - | ENSG00000015475.19
## ENSG00000025708.14 chr22 50525752-50530032 - | ENSG00000025708.14
## ENSG00000025770.19 chr22 50508224-50524780 + | ENSG00000025770.19
## ENSG00000040608.14 chr22 20241415-20283246 - | ENSG00000040608.14
## ... ... ... ... . ...
## ENSG00000288106.1 chr22 38886837-38903794 + | ENSG00000288106.1
## ENSG00000288153.1 chr22 30664961-30674134 + | ENSG00000288153.1
## ENSG00000288262.1 chr22 11474744-11479643 + | ENSG00000288262.1
## ENSG00000288540.1 chr22 48320412-48333199 - | ENSG00000288540.1
## ENSG00000288683.1 chr22 18077989-18131720 + | ENSG00000288683.1
## gene_type gene_name
## <character> <character>
## ENSG00000008735.14 protein_coding MAPK8IP2
## ENSG00000015475.19 protein_coding BID
## ENSG00000025708.14 protein_coding TYMP
## ENSG00000025770.19 protein_coding NCAPH2
## ENSG00000040608.14 protein_coding RTN4R
## ... ... ...
## ENSG00000288106.1 lncRNA AL022318.5
## ENSG00000288153.1 unprocessed_pseudogene AC003072.2
## ENSG00000288262.1 unprocessed_pseudogene CT867976.1
## ENSG00000288540.1 lncRNA AL008720.2
## ENSG00000288683.1 protein_coding AC016027.6
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Depending on the annotation source and organism type gene annotation overlap each other to some degree. This complicates assiging binding sites to their host genes.
# count the number of annotation overlaps
mcols(bindingSites)$geneOl = factor(countOverlaps(bindingSites, gns))
df = as.data.frame(mcols(bindingSites))
#
ggplot(df, aes(x = geneOl)) +
geom_bar() +
scale_y_log10() +
geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
labs(
title = "Overlapping gene annotation problem",
caption = "Bar plot that show how many times a binding site
overlaps with an annotation of a different gene.",
x = "Number of overlaps",
y = "Count (log10)"
) + theme_bw()
In the present case most overlaps seemed to be cause by lncRNA annotations probably residing within introns of protein coding genes. Since we are interested in the binding of our RBP on protein coding genes, we simply remove all binding sites on spurious annotations[^7].
[^7] Note: This is a simplified scheme and might be adapted for more complex research questions.
# show which gene types cause the most annoation overlaps. Split binding sites
# by the number of overlaps and remove those binding sites that do not overlap
# with any annotation (intergenic)
library(forcats)
splitSites = split(bindingSites, bindingSites$geneOl)
df = as(lapply(splitSites, function(x){subsetByOverlaps(gns,x)}), "GRangesList")
df = df[2:length(df)]
df = lapply(1:length(df), function(x) {
data.frame(olClass = names(df[x]),
geneType = factor(df[[x]]$gene_type))})
df = do.call(rbind,df)
df = df %>% mutate(geneType = fct_lump_n(geneType, 4))
#
ggplot(df, aes(x = olClass, fill = geneType)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Overlapping gene annotation causes",
caption = "Percentage bar plot that show the fraction
for each gene type and overlap number.",
fill = "Gene type",
x = "Number of overlaps",
y = "Percent"
) + theme_bw()
# Remove all binding sites that overlap multiple gene annotations and retain
# only protein coding genes and binding sites.
bindingSites = subset(bindingSites, geneOl == 1)
targetGenes = subsetByOverlaps(gns, bindingSites)
targetGenes = subset(targetGenes, gene_type == "protein_coding")
Now that we know our RBP targets mainly protein coding genes, we will have a closer look at positioning within transcript regions of protein coding genes. Again we use a TxDB compiled from gencode annotations to extract the relevant genomic regions7 Note that we prepared this file beforehand and only load the final object (see /inst/scripts/PrepareExampleData.R).
# Count the overlaps of each binding site fore each part of the transcript.
cdseq = cds(annoDb)
intrns = unlist(intronsByTranscript(annoDb))
utrs3 = unlist(threeUTRsByTranscript(annoDb))
utrs5 = unlist(fiveUTRsByTranscript(annoDb))
regions = list(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Load list with transcript regions
regionFile <- list.files(files, pattern = "regions.rds$", full.names = TRUE)
load(regionFile)
As one would expect a lot of multiple different transcript type annotations overlap with single binding sites. These numbers usually rise with more complex annotation sources used. Here most binding sites (3.716) overlap with a single distinct transcript region already, but a substantially amount does not.
# Count the overlaps of each binding site fore each part of the transcript.
cdseq = regions$CDS %>% countOverlaps(bindingSites,.)
intrns = regions$Intron %>% countOverlaps(bindingSites,.)
utrs3 = regions$UTR3 %>% countOverlaps(bindingSites,.)
utrs5 = regions$UTR5 %>% countOverlaps(bindingSites,.)
countDf = data.frame(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Counting how many times an annotation is not present.
df = data.frame(olClass = apply(countDf,1,function(x) length(x[x == 0])))
df = data.frame(type = rev(names(table(df))), count = as.vector(table(df)))
ggplot(df, aes(x = type, y = count)) +
geom_col() +
scale_y_log10() +
geom_text(aes(label = count), vjust=-0.3) +
labs(
title = "Overlapping transcript annotation crashes",
caption = "Bar plot that show how many times a binding site
overlaps with an annotation of a different transcript part",
x = "Number of overlaps",
y = "Count (log10)"
) + theme_bw()
Before assigning binding sites to distinct transcript regions, it is worth to look at the unresolved profile8 Note: Binding sites are essentially counted multiple times here.. Here we observe a predominance for binding sites within intronic regions.
# Count the number of binding sites within each transcript region, ignoring the
# problem of overlapping annotations (counting binding sites multiple times).
plotCountDf = countDf
plotCountDf[plotCountDf > 1] = 1
df = plotCountDf %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarize(count = sum(value), .groups = "drop") %>%
mutate(name = as.factor(name)) %>%
mutate(name = fct_relevel(name, "Intron", "UTR3", "CDS", "UTR5"))
ggplot(df, aes(x = name, y = count)) +
geom_col() +
scale_y_log10() +
xlab("Number of overlaps") +
ylab("Count") +
geom_text(aes(label = count), vjust=-0.3) +
labs(
title = "Binding sites per genomic region",
caption = "Bar plot that shows the number of
binding sites per genomic region.",
x = "Genomic region",
y = "Count (log10)"
) + theme_bw()
The complexity of the problem can also be showcased using an UpSet plot representation. From this plot it becomes very clear that most binding sites reside within introns and that the most annotation conflicts are between introns and 3’UTRs.
# Show how many annotation overlaps are caused by what transcript types.
m = make_comb_mat(plotCountDf)
UpSet(m)
Resolving these types of conflicts is not straight forward and always involves setting up rules that define when to choose which annotation over the other. In our case we went for a majority vote based system, while resolving ties in the vote based on a hierarchical rule prefering introns over 3’UTRs over CDS over 5’UTRs9 Note: In rear cases additional intergenic binding sites can be found even though we removed these in a previous step. This is because some binding sites overlap with the annotation of a gene, but not with any of the transcript isoforms associated with the gene..
# Solve the overlapping transcript types problem with majority votes and
# hierarchical rule.
rule = c("Intron", "UTR3", "CDS", "UTR5")
# Prepare dataframe for counting (reorder by rule, add intergenic counts)
countDf = countDf[, rule] %>%
as.matrix() %>%
cbind.data.frame(., Intergenic = ifelse(rowSums(countDf) == 0, 1, 0) )
# Apply majority vote scheme
regionName = colnames(countDf)
region = apply(countDf, 1, function(x){regionName[which.max(x)]})
mcols(bindingSites)$region = region
df = data.frame(Region = region)
#
ggplot(df, aes(x = fct_infreq(Region))) +
geom_bar() +
scale_y_log10() +
geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
labs(
title = "Binding sites per genomic region",
caption = "Bar plot that shows the number of binding sites per
genomic region.",
x = "Genomic region",
y = "Count (log10)"
) + theme_bw()
For a splicing regulator protein such as U2AF65 binding to introns is expected. However, due to the sheer size difference in introns one might also expect to observe more binding in introns just by chance. For that reason one could divide the number of binding sites by the summed up region length to compute a relative enrichment score10 Note: Such a relative enrichment heavily favours smaller regions such as 5’UTRs. Therefore absolute numbers and relative enrichment should both be considered..
# Sum up the length of each transcript region with overlapping binding sites.
cdsLen = regions$CDS %>%
subsetByOverlaps(., bindingSites) %>%
width() %>%
sum()
intrnsLen = regions$Intron %>%
subsetByOverlaps(., bindingSites) %>%
width() %>%
sum()
utrs3Len = regions$UTR3 %>%
subsetByOverlaps(., bindingSites) %>%
width() %>%
sum()
utrs5Len = regions$UTR5 %>%
subsetByOverlaps(., bindingSites) %>%
width() %>%
sum()
lenSum = c(cdsLen, intrnsLen, utrs3Len, utrs5Len)
# Compute the relative enrichment per transcript region.
bindingSites = subset(bindingSites, region != "Intergenic")
df = data.frame(region = names(table(bindingSites$region)),
count = as.vector(table(bindingSites$region)))
df$regionLength = lenSum
df$normalizedCount = df$count / df$regionLength
#
ggplot(df, aes(x = region, y = normalizedCount)) +
geom_col(width = 0.5) +
labs(
title = "Relative enrichment per genomic region ",
caption = "Bar plot of region count, divided by the summed length.",
x = "Genomic region",
y = "Relative enrichment",
fill = "Genomic region"
) + theme_bw()
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_1.0.8 forcats_0.5.1
## [3] BindingSiteFinder_1.2.0 ComplexHeatmap_2.12.0
## [5] tidyr_1.2.0 ggplot2_3.3.5
## [7] rtracklayer_1.56.0 GenomicAlignments_1.32.0
## [9] Rsamtools_2.12.0 Biostrings_2.64.0
## [11] XVector_0.36.0 SummarizedExperiment_1.26.0
## [13] Biobase_2.56.0 MatrixGenerics_1.8.0
## [15] matrixStats_0.62.0 GenomicRanges_1.48.0
## [17] GenomeInfoDb_1.32.0 IRanges_2.30.0
## [19] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [21] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 doParallel_1.0.17 RColorBrewer_1.1-3
## [4] tools_4.2.0 bslib_0.3.1 utf8_1.2.2
## [7] R6_2.5.1 DBI_1.1.2 colorspace_2.0-3
## [10] GetoptLong_1.0.5 withr_2.5.0 tidyselect_1.1.2
## [13] compiler_4.2.0 cli_3.3.0 Cairo_1.5-15
## [16] DelayedArray_0.22.0 labeling_0.4.2 bookdown_0.26
## [19] sass_0.4.1 scales_1.2.0 stringr_1.4.0
## [22] digest_0.6.29 rmarkdown_2.14 pkgconfig_2.0.3
## [25] htmltools_0.5.2 highr_0.9 fastmap_1.1.0
## [28] rlang_1.0.2 GlobalOptions_0.1.2 farver_2.1.0
## [31] shape_1.4.6 jquerylib_0.1.4 BiocIO_1.6.0
## [34] generics_0.1.2 jsonlite_1.8.0 BiocParallel_1.30.0
## [37] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.8
## [40] Matrix_1.4-1 Rcpp_1.0.8.3 munsell_0.5.0
## [43] fansi_1.0.3 lifecycle_1.0.1 stringi_1.7.6
## [46] yaml_2.3.5 MASS_7.3-57 zlibbioc_1.42.0
## [49] plyr_1.8.7 parallel_4.2.0 crayon_1.5.1
## [52] lattice_0.20-45 circlize_0.4.14 magick_2.7.3
## [55] knitr_1.38 pillar_1.7.0 rjson_0.2.21
## [58] codetools_0.2-18 XML_3.99-0.9 glue_1.6.2
## [61] evaluate_0.15 BiocManager_1.30.17 png_0.1-7
## [64] vctrs_0.4.1 tweenr_1.0.2 foreach_1.5.2
## [67] gtable_0.3.0 purrr_0.3.4 polyclip_1.10-0
## [70] clue_0.3-60 assertthat_0.2.1 xfun_0.30
## [73] ggforce_0.3.3 restfulr_0.0.13 tibble_3.1.6
## [76] iterators_1.0.14 cluster_2.1.3 ellipsis_0.3.2
Busch, Anke, Mirko Brüggemann, Stefanie Ebersberger, and Kathi Zarnack. 2020. “ICLIP Data Analysis: A Complete Pipeline from Sequencing Reads to Rbp Binding Sites.” Methods 178 (June): 49–62. https://doi.org/10.1016/j.ymeth.2019.11.008.
Krakau, Sabrina, Hugues Richard, and Annalisa Marsico. 2017. “PureCLIP: Capturing Target-Specific ProteinRNA Interaction Footprints from Single-Nucleotide Clip-Seq Data.” Genome Biology 18 (1). https://doi.org/10.1186/s13059-017-1364-2.