1 Installation

The BindingSiteFinder package is available at https://bioconductor.org and can be installed via BiocManager::install:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("BindingSiteFinder")

2 Motivation

Most cellular processes are regulated by RNA-binding proteins (RBPs). Knowledge on their exact positioning can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. In a recent publication we described a complete analysis workflow to detect RBP binding sites from iCLIP data. The workflow covers all essential steps, from quality control of sequencing reads, different peak calling options, to the downstream analysis and definition of binding sites. Pre-processing and peak calling steps relies on publicly available software, whereas the definition of the final binding sites follows a custom procedure implemented in R. This vignette explains how equally sized binding sites can be defined from a genome-wide iCLIP coverage.

3 Pre-requisite

The workflow described herein is based on our recently published complete iCLIP analysis pipeline (Busch et al. 2020). Thus, we expect the user to have preprocessed their iCLIP sequencing reads up to the point of the peak calling step. In brief, this includes basic processing of the sequencing reads, such as quality filtering, barcode handling, mapping and the generation of a single nucleotide crosslink file for all replicates under consideration. As we describe in our manuscript replicate .bam files may or may not be merged prior to peak calling, for which we suggest PureCLIP (Krakau, Richard, and Marsico 2017). For simplicity we address only the case where peak calling was based on the merge of all replicates.

Overview of the preprocessing workflow.

Note: If you use BindingSiteFinder in published research, please cite:

Busch, A., Brüggemann, M., Ebersberger, S., & Zarnack, K. (2020) iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites. Methods, 178, 49-62. https://doi.org/10.1016/j.ymeth.2019.11.008

4 Pre-filtering of crosslink sites

An optional step prior to the actual merging of crosslink sites into binding sites, is pre-filtering. Depending on the experiment type or sequencing depth it might be useful to retain only the most informative crosslink sites. In the case of PureCLIP called peaks are associated with a binding affinity strength score which can be used as a metric for pre-filtering.

The following example is based on four replicates of the RNA-binding protein U2AF65. Reads (as .bam files) from all four replicates were merged and subjected to PureCLIP for peak calling. The output comes in the form of .bed files which can be imported as a GRanges Object using the rtracklayer package. Here the PureCLIP socre is used to remove the 5% peaks with the lowest affinity scores.

library(rtracklayer)
csFile <- system.file("extdata", "PureCLIP_crosslink_sites_example.bed", 
                      package="BindingSiteFinder")
cs = import(con = csFile, format = "BED")
cs

## GRanges object with 1000 ranges and 2 metadata columns:
##          seqnames    ranges strand |        name     score
##             <Rle> <IRanges>  <Rle> | <character> <numeric>
##      [1]     chr1    629906      + |           3  27.30000
##      [2]     chr1    629911      + |           3   1.72212
##      [3]     chr1    629912      + |           3  15.68900
##      [4]     chr1    629913      + |           3  13.82090
##      [5]     chr1    629915      + |           3   6.36911
##      ...      ...       ...    ... .         ...       ...
##    [996]     chr1   2396116      + |           3   7.74693
##    [997]     chr1   2396990      + |           3   3.92442
##    [998]     chr1   2396991      + |           3  14.52750
##    [999]     chr1   2396992      + |           3   6.59018
##   [1000]     chr1   2397001      + |           3   1.27159
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

library(ggplot2)
quants = quantile(cs$score, probs = seq(0,1, by = 0.05))
csFilter = cs[cs$score >= quants[2]]

ggplot(data = data.frame(score = cs$score), aes(x = log2(score))) +
    geom_histogram(binwidth = 0.5) +
    geom_vline(xintercept = log2(quants[2])) +
    theme_classic() +
    xlab("PureCLIP score (log2)") +
    ylab("Count")

5 Merge Peaks into binding sites

5.1 Construction of the the BindingSiteFinder dataset

As input the BindingSiteFinder package expects two types of data. First, a GRanges object of all ranges that should be merged into binding sites. In our example this was the created by PureCLIP and we just have to import the file. The second information is the per replicate coverage information in form of a table, which is created below.

# Load clip signal files and define meta data object
files <- system.file("extdata", package="BindingSiteFinder")
clipFilesP <- list.files(files, pattern = "plus.bw$", full.names = TRUE)
clipFilesM <- list.files(files, pattern = "minus.bw$", full.names = TRUE)

meta = data.frame(
  id = c(1,2,3,4),
  condition = factor(c("WT", "WT", "KD", "KD"), 
                     levels = c("KD", "WT")), 
  clPlus = clipFilesP, clMinus = clipFilesM)
meta

##   id condition
## 1  1        WT
## 2  2        WT
## 3  3        KD
## 4  4        KD
##                                                                           clPlus
## 1 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep1_clip_plus.bw
## 2 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep2_clip_plus.bw
## 3 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep3_clip_plus.bw
## 4 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep4_clip_plus.bw
##                                                                           clMinus
## 1 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep1_clip_minus.bw
## 2 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep2_clip_minus.bw
## 3 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep3_clip_minus.bw
## 4 /tmp/RtmpviJCJR/Rinst1d038966ce878/BindingSiteFinder/extdata/rep4_clip_minus.bw

This table has to have at minimum three columns, which must be named condition, clPlus and clMinus. The condition column holds information about for example treatment or KD/ KO experiments111 All four replicates are in fact of the same condition (WT), but two were assigned a different condition for demonstration purposes.. Specifying this column as a factor is important, since any downstream filtering options, which might differ between condition types, are match to the levels of the condition column. the clPlus and clMinus columns point towards the strand specific coverage for each replicate. This information will be imported as RLE objects upon object initialization. The number of ranges and crosslinks imported in the object can be shown once it has been constructed222 Here we load a previously compiled BindingSiteFinder DataSet to save disc space..

library(BindingSiteFinder)
bds = BSFDataSetFromBigWig(ranges = csFilter, meta = meta, silent = TRUE)

exampleFile <- list.files(files, pattern = ".rda$", full.names = TRUE)
load(exampleFile)
bds

## Object of class BSFDataSet 
## Contained ranges:  19.740 
## ----> Number of chromosomes:  68 
## ----> Ranges width:  1 
## Contained conditions:  KD WT

5.2 Choosing the binding site width

Choosing how wide the final binding sites is not an easy choice and depends on various different factors. For that reason different filter options alongside with diagnostic plots were implemented to guide the decision. At first, a ratio between the crosslink events within binding sites and the crosslink events in adjacent windows to the binding sites can be computed. The higher the ratio, the better the associated binding site width captures the distribution of the underlying crosslink events. The supportRatioPlot function offers a quick screen for different width choices333 Note that given the little difference between 7nt and 9nt both options seem to be reasonable.

supportRatioPlot(bds, bsWidths = seq(from = 3, to = 19, by = 2))

To further guide the choice, a selection of potential width can be computed explicitly, here for width 3, 9, 19 and 29 while holding all other parameters constant.

bds1 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 3,
                         minCrosslinks = 2, minClSites = 1)

bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 2, minClSites = 1)

bds3 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 3,
                         minCrosslinks = 2, minClSites = 1)

bds4 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 3,
                         minCrosslinks = 2, minClSites = 1)
l = list(`1. bsSize = 3` = bds1, `2. bsSize = 9` = bds2, 
         `3. bsSize = 19` = bds3, `4. bsSize = 29` = bds4)

The effect of the binding site size choice can be visualized by looking at the total iCLIP coverage. Here each plot is centered around the binding site`s midpoint and the computed width is indicated by the gray frame. In our example size = 3 appears too small, since not all of the relevant peak signal seems to be captured. On the contrary size = 29 appears extremely large. Here we decided for size = 9 because it seems to capture the central coverage peak best.

rangeCoveragePlot(l, width = 20)

5.3 Applying additional constraints

Once a decision on a binding site width has been made additional filtering options can be used to force more/ less stringent binding sites. Essentially an initially computed set of binding sites is filtered by a certain parameter. The minClSites parameter for instance allows the user to define how many of the initially computed crosslink sites should fall within the range of the computed binding site.

bds1 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 2, minClSites = 1)

bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 2, minClSites = 2)

bds3 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 2, minClSites = 3)

bds4 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 2, minClSites = 4)

As one might have expected the number of final binding sites that meet the threshold drops with higher constraints. The coverage on the other hand spans wider around the binding site center, resulting in a broader coverage.

l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2, 
         `3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
mergeSummaryPlot(l, select = "minClSites")

l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2, 
         `3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
rangeCoveragePlot(l, width = 20)

A further point to consider is the number of position within the binding site that are actually covered by iCLIP signal (crosslink events). The minCrosslinks parameter can be used to filter for that. In principal this parameter represents a ‘weaker’ version of the minClSites filter. It should be set in conjunction with the binding site width. Here for example we aim to have about one third of all position to be covered by a crosslink event.

bds10 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 3,
                         minCrosslinks = 1, minClSites = 1)

bds20 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 3,
                         minCrosslinks = 3, minClSites = 1)

bds30 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 3,
                         minCrosslinks = 6, minClSites = 1)

bds40 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 3,
                         minCrosslinks = 10, minClSites = 1)

6 Reproducibility filter

6.1 Choosing a replicate specific cutoff

Since the initial PureCLIP run was based on the merged summary of all four replicates, an additional reproducibility filter must be employed. Based on the merged binding sites from the previous step we can now ask which of these sites are reproducible/ supported by the individual replicates. To do this we first have do decide on a replicates specific threshold. In the present case we decided for the 5% quantile for both conditions444 A lower boundary of 1 nucleotide is set as default minimum support..

reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.05))

reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.2))

If replicates from multiple different conditions are used in the binding site definition, also condition specific thresholds can be applied. For instance the KD condition might be more error prone than the WT, therefore we define a slightly more stringent threshold555 Note that the order of how multiple thresholds are applied is defined based on the order of the factor levels of condition. to account for that fact.

reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.05, 0.1))

reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = c(0.1, 0.05))

6.2 Defining the overall support level

After one has decided which quantile cutoff to use for the replicates the number of replicates that must meet the selected threshold must be specified. This includes another level of regulation and allows for some variation, since not always all replicates are forced to agree on every binding sites. For instance we could implement the rule that 3 out of 4 replicates should agree for each binding site while setting the quantile cutoff to 5% for all 4 replicates. This can be achieved with the reproducibilityFilter function. When setting the returnType = “data.frame”, the function returns a plottable object instead of a final filtered GRanges object. The overlap among the four replicates after the filtering can be easily visualized as an upset plot, using the complexHeatmaps package.

s1 = reproducibilityFilter(bds1, cutoff = c(0.05), n.reps = c(3), 
                           returnType = "data.frame")
library(ComplexHeatmap)
m1 = make_comb_mat(s1)
UpSet(m1)

Another more complex example could make use of both conditions. For instance we could implement the rule that a binding site should pass the filter if it is supported by 2 WT replicates, while for the KD only one replicate is sufficient. As quantile cutoffs we use 5% for the WT and 10% for the KD.

s2 = reproducibilityFilter(bds1, cutoff = c(0.1, 0.05), n.reps = c(1, 2),
                           returnType = "data.frame")
m2 = make_comb_mat(s2)
UpSet(m2)

Once all decisions on filtering parameters have been made the reproducibilityFilter function can be run. The final binding site object can be retrieved through the getRanges successor function. If for the reproducibility filtering different thresholds were used per condition, a range is reported if any of the two conditions yielded TRUE. The filter output per condition is written to the meta columns of the GRanges object which allows additional post-processing.

bdsFinal = reproducibilityFilter(bds1, cutoff = c(0.1, 0.05), n.reps = c(1, 2))
getRanges(bdsFinal)

## GRanges object with 3522 ranges and 2 metadata columns:
##        seqnames            ranges strand |        KD        WT
##           <Rle>         <IRanges>  <Rle> | <logical> <logical>
##      1    chr22 11630136-11630144      + |      TRUE      TRUE
##      2    chr22 11630411-11630419      + |      TRUE      TRUE
##      3    chr22 15785521-15785529      + |      TRUE      TRUE
##      4    chr22 15785544-15785552      + |      TRUE     FALSE
##      5    chr22 15785960-15785968      + |      TRUE     FALSE
##    ...      ...               ...    ... .       ...       ...
##   3290    chr22 50776977-50776985      - |      TRUE     FALSE
##   3290    chr22 50776990-50776998      - |      TRUE      TRUE
##   3291    chr22 50777819-50777827      - |      TRUE     FALSE
##   3292    chr22 50777867-50777875      - |      TRUE      TRUE
##   3293    chr22 50783296-50783304      - |      TRUE      TRUE
##   -------
##   seqinfo: 68 sequences from an unspecified genome; no seqlengths

In case the initial peak calling was performed with PureCLIP we offer the annotateWithScore function to re-annotate computed binding sites with the PureCLIP score from each crosslink site.

bdsFinal = annotateWithScore(bdsFinal, getRanges(bds))
bindingSites = getRanges(bdsFinal)
bindingSites

## GRanges object with 3522 ranges and 5 metadata columns:
##        seqnames            ranges strand |        KD        WT  scoreSum
##           <Rle>         <IRanges>  <Rle> | <logical> <logical> <numeric>
##      1    chr22 11630136-11630144      + |      TRUE      TRUE 595.00512
##      2    chr22 11630411-11630419      + |      TRUE      TRUE 121.23981
##      3    chr22 15785521-15785529      + |      TRUE      TRUE  24.01861
##      4    chr22 15785544-15785552      + |      TRUE     FALSE  30.54464
##      5    chr22 15785960-15785968      + |      TRUE     FALSE   6.85856
##    ...      ...               ...    ... .       ...       ...       ...
##   3290    chr22 50776977-50776985      - |      TRUE     FALSE   6.38966
##   3290    chr22 50776990-50776998      - |      TRUE      TRUE  60.35172
##   3291    chr22 50777819-50777827      - |      TRUE     FALSE   9.21558
##   3292    chr22 50777867-50777875      - |      TRUE      TRUE  54.12220
##   3293    chr22 50783296-50783304      - |      TRUE      TRUE  39.85535
##        scoreMean  scoreMax
##        <numeric> <numeric>
##      1 148.75128 334.18700
##      2  30.30995  77.58330
##      3   4.80372   9.21879
##      4  10.18155  15.20870
##      5   3.42928   5.67521
##    ...       ...       ...
##   3290   3.19483   3.19483
##   3290  20.11724  37.55070
##   3291   3.07186   5.38047
##   3292  18.04073  22.78030
##   3293  13.28512  16.22960
##   -------
##   seqinfo: 68 sequences from an unspecified genome; no seqlengths

7 Definition of the binding site profile

After the definition of binding sites one is usually interested in the binding profile of the RBP. General question are: Which gene type does my RBP bind? If it binds to protein coding genes, to which part of the transcripts does it bind to? Answering these types of questions involves using an additional gene annotation resource to overlap the binding sites with. In the following we demonstrate how this could be achieved using GenomicRanges objects.

7.1 Target gene profile

As gene annotation any desired source can be used, as long as it can be converted into a TxDb object. In the present example we use gencode gene annotations (hg38) from chromosome 22 imported from GFF3 format. We extract the ranges from the TxDb object and match them with additional information from the GFF3 file666 Note that we prepared this file beforehand and only load the final object (see /inst/scripts/PrepareExampleData.R).

library(GenomicFeatures)
# Make annotation database from gff3 file
annoFile = "./gencode.v37.annotation.chr22.header.gff3"
annoDb = makeTxDbFromGFF(file = annoFile, format = "gff3")
annoInfo = import(annoFile, format = "gff3")
# Get genes as GRanges
gns = genes(annoDb)
idx = match(gns$gene_id, annoInfo$gene_id)
elementMetadata(gns) = cbind(elementMetadata(gns),
                             elementMetadata(annoInfo)[idx,])

# Load GRanges with genes
geneFile <- list.files(files, pattern = "gns.rds$", full.names = TRUE)
load(geneFile)
gns

## GRanges object with 1387 ranges and 3 metadata columns:
##                      seqnames            ranges strand |            gene_id
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSG00000008735.14    chr22 50600793-50613981      + | ENSG00000008735.14
##   ENSG00000015475.19    chr22 17734138-17774770      - | ENSG00000015475.19
##   ENSG00000025708.14    chr22 50525752-50530032      - | ENSG00000025708.14
##   ENSG00000025770.19    chr22 50508224-50524780      + | ENSG00000025770.19
##   ENSG00000040608.14    chr22 20241415-20283246      - | ENSG00000040608.14
##                  ...      ...               ...    ... .                ...
##    ENSG00000288106.1    chr22 38886837-38903794      + |  ENSG00000288106.1
##    ENSG00000288153.1    chr22 30664961-30674134      + |  ENSG00000288153.1
##    ENSG00000288262.1    chr22 11474744-11479643      + |  ENSG00000288262.1
##    ENSG00000288540.1    chr22 48320412-48333199      - |  ENSG00000288540.1
##    ENSG00000288683.1    chr22 18077989-18131720      + |  ENSG00000288683.1
##                                   gene_type   gene_name
##                                 <character> <character>
##   ENSG00000008735.14         protein_coding    MAPK8IP2
##   ENSG00000015475.19         protein_coding         BID
##   ENSG00000025708.14         protein_coding        TYMP
##   ENSG00000025770.19         protein_coding      NCAPH2
##   ENSG00000040608.14         protein_coding       RTN4R
##                  ...                    ...         ...
##    ENSG00000288106.1                 lncRNA  AL022318.5
##    ENSG00000288153.1 unprocessed_pseudogene  AC003072.2
##    ENSG00000288262.1 unprocessed_pseudogene  CT867976.1
##    ENSG00000288540.1                 lncRNA  AL008720.2
##    ENSG00000288683.1         protein_coding  AC016027.6
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Depending on the annotation source and organism type gene annotation overlap each other to some degree. This complicates assiging binding sites to their host genes.

# count the number of annotation overlaps
mcols(bindingSites)$geneOl = factor(countOverlaps(bindingSites, gns))
df = as.data.frame(mcols(bindingSites))
# 
ggplot(df, aes(x = geneOl)) +
    geom_bar() +
    scale_y_log10() +
    geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
    labs(
        title = "Overlapping gene annotation problem",
        caption = "Bar plot that show how many times a binding site
    overlaps with an annotation of a different gene.",
    x = "Number of overlaps", 
    y = "Count (log10)"
    ) + theme_bw()

In the present case most overlaps seemed to be cause by lncRNA annotations probably residing within introns of protein coding genes. Since we are interested in the binding of our RBP on protein coding genes, we simply remove all binding sites on spurious annotations[^7].

[^7] Note: This is a simplified scheme and might be adapted for more complex research questions.

# show which gene types cause the most annoation overlaps. Split binding sites
# by the number of overlaps and remove those binding sites that do not overlap
# with any annotation (intergenic)
library(forcats)
splitSites = split(bindingSites, bindingSites$geneOl)
df = as(lapply(splitSites, function(x){subsetByOverlaps(gns,x)}), "GRangesList")
df = df[2:length(df)]
df = lapply(1:length(df), function(x) {
    data.frame(olClass = names(df[x]),
               geneType = factor(df[[x]]$gene_type))})
df = do.call(rbind,df)
df = df %>% mutate(geneType = fct_lump_n(geneType, 4)) 
# 
ggplot(df, aes(x = olClass, fill = geneType)) +
    geom_bar(position = "fill") + 
    scale_fill_brewer(palette = "Set2") +
    scale_y_continuous(labels = scales::percent) +
    labs(
        title = "Overlapping gene annotation causes",
        caption = "Percentage bar plot that show the fraction
    for each gene type and overlap number.",
    fill = "Gene type",
    x = "Number of overlaps", 
    y = "Percent"
    ) + theme_bw()

# Remove all binding sites that overlap multiple gene annotations and retain
# only protein coding genes and binding sites.
bindingSites = subset(bindingSites, geneOl == 1)
targetGenes = subsetByOverlaps(gns, bindingSites)
targetGenes = subset(targetGenes, gene_type == "protein_coding")

7.2 Transcript level profile

Now that we know our RBP targets mainly protein coding genes, we will have a closer look at positioning within transcript regions of protein coding genes. Again we use a TxDB compiled from gencode annotations to extract the relevant genomic regions777 Note that we prepared this file beforehand and only load the final object (see /inst/scripts/PrepareExampleData.R).

# Count the overlaps of each binding site fore each part of the transcript. 
cdseq = cds(annoDb) 
intrns = unlist(intronsByTranscript(annoDb)) 
utrs3 = unlist(threeUTRsByTranscript(annoDb)) 
utrs5 = unlist(fiveUTRsByTranscript(annoDb)) 
regions = list(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)

# Load list with transcript regions
regionFile <- list.files(files, pattern = "regions.rds$", full.names = TRUE)
load(regionFile)

As one would expect a lot of multiple different transcript type annotations overlap with single binding sites. These numbers usually rise with more complex annotation sources used. Here most binding sites (3.716) overlap with a single distinct transcript region already, but a substantially amount does not.

# Count the overlaps of each binding site fore each part of the transcript. 
cdseq = regions$CDS %>% countOverlaps(bindingSites,.)
intrns = regions$Intron %>% countOverlaps(bindingSites,.)
utrs3 = regions$UTR3 %>% countOverlaps(bindingSites,.)
utrs5 = regions$UTR5 %>% countOverlaps(bindingSites,.)
countDf = data.frame(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Counting how many times an annotation is not present.
df = data.frame(olClass = apply(countDf,1,function(x) length(x[x == 0]))) 
df = data.frame(type = rev(names(table(df))), count = as.vector(table(df))) 
ggplot(df, aes(x = type, y = count)) +
    geom_col() +
    scale_y_log10() +
    geom_text(aes(label = count), vjust=-0.3) +
    labs(
        title = "Overlapping transcript annotation crashes",
        caption = "Bar plot that show how many times a binding site 
        overlaps with an annotation of a different transcript part",
        x = "Number of overlaps", 
        y = "Count (log10)"
    ) + theme_bw()

Before assigning binding sites to distinct transcript regions, it is worth to look at the unresolved profile888 Note: Binding sites are essentially counted multiple times here.. Here we observe a predominance for binding sites within intronic regions.

# Count the number of binding sites within each transcript region, ignoring the 
# problem of overlapping annotations (counting binding sites multiple times).
plotCountDf = countDf
plotCountDf[plotCountDf > 1] = 1
df = plotCountDf %>% 
    pivot_longer(everything()) %>% 
    group_by(name) %>% 
    summarize(count = sum(value), .groups = "drop") %>%
    mutate(name = as.factor(name)) %>%
    mutate(name = fct_relevel(name, "Intron", "UTR3", "CDS", "UTR5"))

ggplot(df, aes(x = name, y = count)) + 
    geom_col() +
    scale_y_log10() +
    xlab("Number of overlaps") +
    ylab("Count") +
    geom_text(aes(label = count), vjust=-0.3) +
    labs(
        title = "Binding sites per genomic region",
        caption = "Bar plot that shows the number of
    binding sites per genomic region.",
    x = "Genomic region", 
    y = "Count (log10)"
    ) + theme_bw()

The complexity of the problem can also be showcased using an UpSet plot representation. From this plot it becomes very clear that most binding sites reside within introns and that the most annotation conflicts are between introns and 3’UTRs.

# Show how many annotation overlaps are caused by what transcript types.
m = make_comb_mat(plotCountDf)
UpSet(m)

Resolving these types of conflicts is not straight forward and always involves setting up rules that define when to choose which annotation over the other. In our case we went for a majority vote based system, while resolving ties in the vote based on a hierarchical rule prefering introns over 3’UTRs over CDS over 5’UTRs999 Note: In rear cases additional intergenic binding sites can be found even though we removed these in a previous step. This is because some binding sites overlap with the annotation of a gene, but not with any of the transcript isoforms associated with the gene..

# Solve the overlapping transcript types problem with majority votes and 
# hierarchical rule. 
rule = c("Intron", "UTR3", "CDS", "UTR5")
# Prepare dataframe for counting (reorder by rule, add intergenic counts)
countDf = countDf[, rule] %>% 
    as.matrix() %>%
    cbind.data.frame(., Intergenic = ifelse(rowSums(countDf) == 0, 1, 0) ) 
# Apply majority vote scheme
regionName = colnames(countDf)
region = apply(countDf, 1, function(x){regionName[which.max(x)]})
mcols(bindingSites)$region = region
df = data.frame(Region = region)
# 
ggplot(df, aes(x = fct_infreq(Region))) +
    geom_bar() +
    scale_y_log10() +
    geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
    labs(
        title = "Binding sites per genomic region",
        caption = "Bar plot that shows the number of binding sites per
    genomic region.",
    x = "Genomic region", 
    y = "Count (log10)"
    ) + theme_bw()

For a splicing regulator protein such as U2AF65 binding to introns is expected. However, due to the sheer size difference in introns one might also expect to observe more binding in introns just by chance. For that reason one could divide the number of binding sites by the summed up region length to compute a relative enrichment score101010 Note: Such a relative enrichment heavily favours smaller regions such as 5’UTRs. Therefore absolute numbers and relative enrichment should both be considered..

# Sum up the length of each transcript region with overlapping binding sites.
cdsLen = regions$CDS %>% 
    subsetByOverlaps(., bindingSites) %>% 
    width() %>% 
    sum()
intrnsLen = regions$Intron %>% 
    subsetByOverlaps(., bindingSites) %>%
    width() %>% 
    sum()
utrs3Len = regions$UTR3 %>% 
    subsetByOverlaps(., bindingSites) %>% 
    width() %>% 
    sum()
utrs5Len = regions$UTR5 %>% 
    subsetByOverlaps(., bindingSites) %>% 
    width() %>%
    sum()
lenSum = c(cdsLen, intrnsLen, utrs3Len, utrs5Len)

# Compute the relative enrichment per transcript region.
bindingSites = subset(bindingSites, region != "Intergenic")
df = data.frame(region = names(table(bindingSites$region)),
                count = as.vector(table(bindingSites$region)))
df$regionLength = lenSum
df$normalizedCount = df$count / df$regionLength
# 
ggplot(df, aes(x = region, y = normalizedCount)) +
    geom_col(width = 0.5) +
    labs(
        title = "Relative enrichment per genomic region ",
        caption = "Bar plot of region count, divided by the summed length.",
        x = "Genomic region", 
        y = "Relative enrichment",
        fill = "Genomic region"
    ) + theme_bw()

8 Session info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.7                 forcats_0.5.1              
##  [3] BindingSiteFinder_1.0.0     ComplexHeatmap_2.10.0      
##  [5] tidyr_1.1.4                 ggplot2_3.3.5              
##  [7] rtracklayer_1.54.0          GenomicAlignments_1.30.0   
##  [9] Rsamtools_2.10.0            Biostrings_2.62.0          
## [11] XVector_0.34.0              SummarizedExperiment_1.24.0
## [13] Biobase_2.54.0              MatrixGenerics_1.6.0       
## [15] matrixStats_0.61.0          GenomicRanges_1.46.0       
## [17] GenomeInfoDb_1.30.0         IRanges_2.28.0             
## [19] S4Vectors_0.32.0            BiocGenerics_0.40.0        
## [21] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           doParallel_1.0.16      RColorBrewer_1.1-2    
##  [4] tools_4.1.1            bslib_0.3.1            utf8_1.2.2            
##  [7] R6_2.5.1               DBI_1.1.1              colorspace_2.0-2      
## [10] GetoptLong_1.0.5       withr_2.4.2            tidyselect_1.1.1      
## [13] compiler_4.1.1         DelayedArray_0.20.0    labeling_0.4.2        
## [16] bookdown_0.24          sass_0.4.0             scales_1.1.1          
## [19] stringr_1.4.0          digest_0.6.28          rmarkdown_2.11        
## [22] pkgconfig_2.0.3        htmltools_0.5.2        highr_0.9             
## [25] fastmap_1.1.0          rlang_0.4.12           GlobalOptions_0.1.2   
## [28] shape_1.4.6            jquerylib_0.1.4        BiocIO_1.4.0          
## [31] generics_0.1.1         farver_2.1.0           jsonlite_1.7.2        
## [34] BiocParallel_1.28.0    RCurl_1.98-1.5         magrittr_2.0.1        
## [37] GenomeInfoDbData_1.2.7 Matrix_1.3-4           Rcpp_1.0.7            
## [40] munsell_0.5.0          fansi_0.5.0            lifecycle_1.0.1       
## [43] stringi_1.7.5          yaml_2.2.1             MASS_7.3-54           
## [46] zlibbioc_1.40.0        parallel_4.1.1         crayon_1.4.1          
## [49] lattice_0.20-45        circlize_0.4.13        magick_2.7.3          
## [52] knitr_1.36             pillar_1.6.4           rjson_0.2.20          
## [55] codetools_0.2-18       XML_3.99-0.8           glue_1.4.2            
## [58] evaluate_0.14          BiocManager_1.30.16    png_0.1-7             
## [61] vctrs_0.3.8            tweenr_1.0.2           foreach_1.5.1         
## [64] gtable_0.3.0           purrr_0.3.4            polyclip_1.10-0       
## [67] clue_0.3-60            assertthat_0.2.1       xfun_0.27             
## [70] ggforce_0.3.3          restfulr_0.0.13        tibble_3.1.5          
## [73] iterators_1.0.13       cluster_2.1.2          ellipsis_0.3.2

Bibliography

Busch, Anke, Mirko Brüggemann, Stefanie Ebersberger, and Kathi Zarnack. 2020. “ICLIP Data Analysis: A Complete Pipeline from Sequencing Reads to Rbp Binding Sites.” Methods 178 (June): 49–62. https://doi.org/10.1016/j.ymeth.2019.11.008.

Krakau, Sabrina, Hugues Richard, and Annalisa Marsico. 2017. “PureCLIP: Capturing Target-Specific ProteinRNA Interaction Footprints from Single-Nucleotide Clip-Seq Data.” Genome Biology 18 (1). https://doi.org/10.1186/s13059-017-1364-2.

Definition of binding sites from iCLIP signal

26 October 2021

Abstract

Package