DuplexDiscovereR tutorial

Egor Semenchenko1*

1Max Delbrück Center for molecular medicine; Berlin Institute for Medical Systems Biology; Freie Universität Berlin

29 October 2024

Abstract

DuplexDiscovereR implements methods for analysing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq and other methods which yeild information on RNA-RNA interactions as chimeric read fragments after high-throughput sequencing.

Package

DuplexDiscovereR 1.1.0

1 Installation

DuplexDiscovereR can be installed from Bioconductor:

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("DuplexDiscovereR")

library(DuplexDiscovereR)
?DuplexDiscovereR

or from Github page:

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

library(devtools)
devtools::install_github("Egors01/DuplexDiscovereR")

1.1 Installing RNAduplex

For calculating hybridization energies DuplexDiscovereR uses RNAduplex program from the ViennaRNA software suite.

Although this step is optional and is not required for using most of the package methods, we strongly recommend installing ViennaRNA from its web-page, as predicting RNA hybrids is one of the central aims of the analysis of RNA-RNA interaction data. Please note that ViennaRNA is distributed under its own licence.

2 Introduction

RNA cross-linking and proximity ligation methods as SPLASH, PARIS, LIGR-seq, RIC-seq and others provide infromation on the RNA-RNA interactions on a transcriptome-wide scale. These experiments generate high-throughput sequencing data containing a fraction of chimeric reads, where each chimeric part corresponds to the base-paired stretches of RNA (RNA duplexes)

DuplexDiscovereR is designed for the bioinformatic analysis of RNA data. It employs a workflow that allows users to identify RNA duplexes and and provides several options for filtering and ranking the results.

Starting from the set of aligned chimeric reads, it implements Classification and filtering of non-RNA duplex alignments and their into duplex groups (DGs). Identified DGs are then annotated with transcripts or other genomic features. P-values are calculated to test the hypothesis that DGs are generated by random ligation. Finally, RNA duplex base pairing and hybridization energies are predicted.

The optimal procedures for identifying RNA duplexes may vary between methods, as RNA-RNA interaction probing protocols are known to have biases for certain types of RNA and differ in experimental steps. DuplexDiscovereR allows user to build customized analysis by utilizing its methods for efficient chimeric read clustering, classification and convenience functions for flexible data filtering, annotation and visualization.

RNA duplex data is known to be sparse, with little overlap between the results of between the results of different probing protocols. To facilitate comparisons between different experiments and to allow straightforward cross-checking between replicates, DuplexDiscovereR includes functionality to intersect multiple RNA interaction datasets.

DuplexDiscovereR relies on the GInteractions container from InteractionSet package for the storing of the RNA-RNA interaction data and uses tidyverse tibble and dplyr for internal the Gviz - based DuplexTrack track and supports the exportof the results to the .sam file.

3 Quick start

3.1 Analysis steps

Key steps of RNA duplex data analysis with DuplexDiscovereR can be run through the single function call. It is expected that user already aligned the raw sequencing reads which produced some chimeric or split-read alignments. The input can be provided in several formats:

Tabular file with the split-read alignments
- Chimeric.out.junction file generated by STAR
- .bedpe format file (defined in bedtools)
GInteractions object with chimeric alignments.

Full DuplexDiscovereR workflow consists of the following steps:

Pre-processing of the input. This step calls the classification of the reads into following classes:
- 2arm reads (only one chimeric junction per aligned read)
- multi-split (two or more one chimeric junctions per aligned read)
- multi-mapped reads
Comparing the 2arm reads mapped reads with the splice junction annotation. This filters out alignments across the exon-exon junctions which typically contaminate the ‘true’ chimeric alignments originated from the RNA duplexes
Deduplication of the identical read alignments
Clustering of reads into duplex groups (DGs)

Optionally,

Annotation of the DGs with transcript or other features
Prediction of the base-pairing hybridization energies
Computing p-values from testing the probability of random ligation

3.2 Input arguments

Required inputs for minimal run are the following:

data Chimeric read alignments or coordinates. Object of any table type, convertible to tibble or GInteractions
junctions_gr GRanges object with splice junction coordinates.
Library type and type of the read table, if input is not GInteractions.

Optional inputs for executing the full workflow:

anno_gr a GRanges object with genes or transcripts annotation object
df_counts a 2-column table with the id in the 1st column and raw read count in the 2nd. id should correspond to the entries in anno_gr. Providing this argument triggers p-value calculation.
fafile path to .fasta file with the genome. Providing this argument triggers prediction of RNA hybrids.

3.3 Loading example data

Once installed, we can load the example data. Example dataset is based on the RNA duplex probing of SPLASH ES-cells Aw. et.al 2016 which was aligned with STAR Dobin et.al, 2013 and subset to the chromosome 22.

library(DuplexDiscovereR)
data(RNADuplexesSampleData)

Details on alignment of the reads of example dataset and fetching of the annotation data are described in the document which can be found at

system.file("extdata", "scripts", "DD_data_generation.R", package = "DuplexDiscovereR")
## [1] ""

3.4 Running analysis

genome_fasta <- NULL
result <- DuplexDiscovereR::runDuplexDiscoverer(
    data = RNADuplexesRawChimSTAR,
    junctions_gr = SampleSpliceJncGR,
    anno_gr = SampleGeneAnnoGR,
    df_counts = RNADuplexesGeneCounts,
    sample_name = "example_run",
    lib_type = "SE",
    table_type = "STAR",
    fafile = genome_fasta,
)

The result is a DuplexDiscovererResults object which contains several outputs of different dimensions.

result
## DuplexDiscovereR Results
## Sample name: example_run
## Chimeric reads statistics:
##          feature   count
##    n_reads_input 5000.00
##            n_dgs   70.00
##             2arm  730.00
##       2arm_short   85.00
##          2arm_sj 1268.00
##        multi_map  218.00
##      multi_split  285.00
##  multi_split&map  234.00
##         not_chim 2173.00
##         self_ovl    7.00
##       mean_len_A   58.83
##     median_len_A   59.00
##       mean_len_B   59.19
##     median_len_B   59.00
## Use class accessors to retrieve output. I.e duplex_groups(resultsObject)

The primary output are detected duplex groups. This is a GInteractions, where each entry represents boundaries of the detected RNA duplex. Metadata fields as n_reads , p_val , score and others can be used for ranking and sub-setting results for downstream analysis.

gi_clusters <- dd_get_duplex_groups(result)
head(gi_clusters, 2)
## StrictGInteractions object with 2 interactions and 17 metadata columns:
##       seqnames1           ranges1     seqnames2           ranges2 |   n_reads
##           <Rle>         <IRanges>         <Rle>         <IRanges> | <numeric>
##   [1]     chr22 39318489-39318567 ---     chr22 39319587-39319625 |        16
##   [2]     chr22 24555998-24556058 ---     chr22 24557657-24557731 |         8
##           dg_id     score          gene_id.A          gene_id.B gene_name.A
##       <integer> <numeric>        <character>        <character> <character>
##   [1]         1   85.8981 ENSG00000100316.16 ENSG00000100316.16        RPL3
##   [2]         2   86.3571 ENSG00000100028.12 ENSG00000100028.12      SNRPD3
##       gene_name.B    gene_type.A    gene_type.B   ambig.A   ambig.B
##       <character>    <character>    <character> <integer> <integer>
##   [1]        RPL3 protein_coding protein_coding         0         0
##   [2]      SNRPD3 protein_coding protein_coding         1         1
##                 ambig_list.A           ambig_list.B       cis       p_val
##                  <character>            <character> <numeric>   <numeric>
##   [1]                   <NA>                   <NA>         1 0.00000e+00
##   [2] SNRPD3,ENSG00000286070 SNRPD3,ENSG00000286070         1 3.43341e-18
##       gene_count.A gene_count.B
##          <numeric>    <numeric>
##   [1]       795095       795095
##   [2]         3983         3983
##   -------
##   regions: 140 ranges and 1 metadata column
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Individual read-level data can be accessed by running:

gi_reads <- dd_get_chimeric_reads(result)
head(gi_reads, 2)
## StrictGInteractions object with 2 interactions and 30 metadata columns:
##       seqnames1           ranges1     seqnames2           ranges2 |
##           <Rle>         <IRanges>         <Rle>         <IRanges> |
##   [1]     chr22 37877934-37877970 ---     chr22 37878003-37878043 |
##   [2]     chr22 39314697-39314721 ---     chr22 39314743-39314788 |
##                  readname    map_type     score brkpt_donorA brkpt_acceptorB
##               <character> <character> <numeric>    <numeric>       <numeric>
##   [1] SRR3404943.86173415        2arm        74     37878043        37877933
##   [2] SRR3404943.37708269        2arm        67     39314696        39314788
##       junction_type repeat_left_lenA repeat_right_lenB  cigar_alnA  cigar_alnB
##            <factor>        <numeric>         <numeric> <character> <character>
##   [1]          2arm                0                 1    1S40M36S      41S36M
##   [2]          2arm                0                 0      45S24M      45M24S
##       num_chim_aln max_poss_aln_score non_chim_aln_score bestall_chim_aln_score
##          <numeric>          <numeric>          <numeric>              <numeric>
##   [1]            1                 77                 40                     74
##   [2]            1                 69                 46                     67
##       PEmerged_bool   readgrp    ngapsA    ngapsB  lengapsA  lengapsB   n_reads
##           <numeric> <logical> <integer> <integer> <numeric> <numeric> <numeric>
##   [1]             0      <NA>         0         0         0         0         1
##   [2]             0      <NA>         0         0         0         0         1
##         read_id splicejnc splicejnc_donor splicejnc_acceptor      keep
##       <integer> <integer>       <numeric>          <numeric> <logical>
##   [1]        13         0               0                  0      TRUE
##   [2]        14         0               0                  0      TRUE
##           dg_id duplex_id n_reads_dg was_clustered
##       <integer> <integer>  <numeric>     <numeric>
##   [1]      <NA>         1          0             1
##   [2]         7         2          8             1
##   -------
##   regions: 3721 ranges and 0 metadata columns
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Each entry in gi_reads is a single read. They are tagged by DG with the dg_id, but not collapsed into groups. This layout is used for visualization and may be useful for a more detailed review of the structure of the individual DGs.

The dimensions of gi_reads do not correspond 1:1 to the size of the to the size of the input, since in general not every element in the input data can be represented as a 2-arm split alignment.

To track each input alignment one can query the the read-level classification results, which have the same dimension with the input.

df_reads <- dd_get_reads_classes(result)
print(dim(df_reads))
## [1] 5000   11
print(dim(RNADuplexesRawChimSTAR))
## [1] 5000   21

df_reads has fields such as read_type, map_type and junction_type indicating the results of chimeric read classification and filtering.

This layout is useful for assessing the statistics and highlighting potential problems and optimization points, i.e. how many reads from the input were classified as true chimeric and how many of them were assigned to duplex groups.

table(df_reads$read_type)
## 
##            2arm      2arm_short         2arm_sj       multi_map     multi_split 
##             730              85            1268             218             285 
## multi_split&map        not_chim        self_ovl 
##             234            2173               7

Among the 2-arm alignments from gi_reads, only unambigously aligned reads without self-ovelaps that pass splice-junctions and minimum junction length filters are subjected to clustering into DGs. To see the determined types of 2-arm alignments, we can check the junction_type metadata field

table(gi_reads$junction_type)
## 
##          2arm    2arm_short      self_ovl antisense_ovl 
##          1998            85             7             0

3.5 Writing output

The resulting GInteractions with duplex groups are ready for integration into transcriptomic analysis using other Bioconductor packages.

There are several options available for outputting the results for other uses.

3.5.1 Output to table

One can convert the object with duplex groups to dataframe-like object and write it to file

clusters_dt <- makeDfFromGi(gi_clusters)
write.table(clusters_dt, file = tempfile(pattern = "dgs_out", fileext = ".tab"))

3.5.2 Output to .sam file

Saving results to .sam file can be a useful tool for downstream visualization with interactive viewers as IGV

writeGiToSAMfile(gi_clusters, file_out = tempfile(pattern = "dgs_out", fileext = ".sam"), genome = "hg38")

3.6 Visualization

Collapsed DGs can be visualized with Gviz -based DuplexTrack, allowing overlays of RNA duplexes with the genes of other features

library(Gviz)
# define plotting region
plotrange <- GRanges(
    seqnames = "chr22",
    ranges = IRanges(
        start = c(37877619),
        end = c(37878053)
    ),
    strand = "+"
)
# make genes track
anno_track <- Gviz::GeneRegionTrack(SampleGeneAnnoGR,
    name = "chr22 Genes",
    range = plotrange
)
# construct DuplexTrack
duplex_track <- DuplexTrack(
    gi = gi_clusters,
    gr_region = plotrange,
    labels.fontsize = 12,
    arcConstrain = 4,
    annotation.column1 = "gene_name.A",
    annotation.column2 = "gene_name.B"
)
plotTracks(list(anno_track, duplex_track),
    sizes = c(1, 3),
    from = start(plotrange) - 100,
    to = end(plotrange) + 100
)

3.7 Calcualting hybridization energies

In the example above we have omitted the base-pairing prediction and the hybridization energies by not passing the path to file with genome as the fafile argument. However, the prediction of RNA hybrids is central to the analysis of RNA-RNA interaction data.

To compute the structure formed by two stretches of RNA DuplexDiscoverer uses RNAduplex algorithm from ViennaRNA. Note that ViennaRNA is distributed under its own licence. Please refer to ViennaRNA’s license for details.

If the RNAduplex is installed, all steps of the analysis can be re-run by calling runDuplexDiscoverer with the option fafile = <path to genome fasta>. Alternatively, one can predict RNA hybrids by calling separate function on already existing GInteractions object:

sequence <- paste0(
"AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU",
"CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU")

# Save the sequence to a temp fasta file
fasta_file = tempfile(fileext = '.fa')
chrom <- "test_chrA"
writeLines(c(">test_chrA", sequence), con = fasta_file)

# Create the GInteraction object
# Define start and end positions for the base-pairing regions
regions <- data.frame(
    start1 = c(1, 11, 21, 31, 41),
    end1 = c(10, 20, 30, 40, 50),
    start2 = c(91, 81, 71, 61, 51),
    end2 = c(100, 90, 80, 70, 60)
)
# GRanges objects for the anchors
anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1))
anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2))
interaction <- GInteractions(anchor1, anchor2)
# predict hybrids
getRNAHybrids(interaction,fasta_file)

3.8 Comparing multiple samples or replicates

Due to the sparse nature of RNA duplex data, it is desirable to be able to compare replicates of the same experiment or to compare different data sets.

DuplexDiscovereR provides a method for finding intersections between multiple samples. Comparisons between ranges can result in one-to-many relations, i.e. when a region (or pair a of regions) of one sample contains multiple regions of other samples. DuplexDiscovereR addresses this problem by using the following procedure: first it computes a non-redundant set of interactions that stacks and collapses interactions from all input samples. This resulting super-set is then compared in a pairwise manner with each entry from each individual sample.

To demonstrate how multiple samples can be compared, we will generate three pseudo-replicates from the example data which we clustered above.

First, we will select the clustered reads and divide them into three groups

clust_reads <- gi_reads
clust_reads <- clust_reads[!is.na(clust_reads$dg_id)]

# Create separate pseudo-sample objects for each group
set.seed(123)
group_indices <- sample(rep(1:3, length.out = length(clust_reads)))

group1 <- clust_reads[group_indices == 1]
group2 <- clust_reads[group_indices == 2]
group3 <- clust_reads[group_indices == 3]

Since individual reads have already been assigned to duplex groups, we can collapse them, ensuring that we have some overlap between reads that belong to the same ‘true’ DGs, but are spread across three artificial sets.

group1 <- collapse_duplex_groups(group1)
group2 <- collapse_duplex_groups(group2)
group3 <- collapse_duplex_groups(group3)

a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3)
res_comp <- compareMultipleInteractions(a)
names(res_comp)
## [1] "gi_all"   "dt_upset"

Result contains slot for interaction superset and overlap information information. We can visualize the intersections of the samples using the upset plot

# first, check if UpSetR is installed
if (!requireNamespace("UpSetR", quietly = TRUE)) {
    stop("Install 'UpSetR' to use this function.")
}
library(UpSetR)
upset(res_comp$dt_upset, text.scale = 1.5)

4 Building customized RNA duplex data analysis

One can customize the analysis by performing classification, clustering and subsequent annotation procedures step by step, adapting the procedures to the RNA-RNA interaction probing protocol or upstream/downstream analysis.

In this section, we show how users can use the core methods in the package to achieve more flexibility in this task.

4.1 Preprocessing

Most of the methods in DuplexDiscovereR work with the RNA-RNA interaction data stored in GInteraction object. There are several metadata fields which are not strictly required by the package, but utilized in procedures run classification: map_type, read_id and n_reads,score for clustering.

The easiest way to run DuplexDiscovereR step-by-step is to start with a pre-processing routine. It is fairly flexible in input type: it can be either GInteractions , Chimeric.out.Junction file from STAR or generic .bedpe file with pairs of regions.

data("RNADuplexesSampleData")

new_star <- runDuplexDiscoPreproc(
    data = RNADuplexesRawChimSTAR,
    table_type = "STAR",
    library_type = "SE",
)

new_bedpe <- runDuplexDiscoPreproc(
    data = RNADuplexesRawBed,
    table_type = "bedpe",
    library_type = "SE", return_gi = TRUE
)

By default, to make downstream read filtering slightly more convenient, tibble is returned after the pre-processing. In case of the GInteractions input, input reads already are 2-arm split by design, so we can return GInteractions .

# keep only readname in metadata
mcols(RNADuplexSampleGI) <- mcols(RNADuplexSampleGI)["readname"]
new_gi <- runDuplexDiscoPreproc(data = RNADuplexSampleGI, return_gi = TRUE)
head(new_gi, 1)
## StrictGInteractions object with 1 interaction and 5 metadata columns:
##       seqnames1           ranges1     seqnames2           ranges2 |
##           <Rle>         <IRanges>         <Rle>         <IRanges> |
##   [1]     chr22 38303215-38303249 ---     chr22 38336125-38336170 |
##                   readname    map_type     score   n_reads   read_id
##                <character> <character> <numeric> <numeric> <integer>
##   [1] SRR3404943.141026859        2arm         1         1         1
##   -------
##   regions: 3721 ranges and 0 metadata columns
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.2 Classification and filtering

The basic classification routines implemented in the package are comparing the chimeric junctions to the known splice junctions and determining the types of the junctions. Both can be called by a separate method that adds a corresponding metadata field to the input:

gi <- getChimericJunctionTypes(new_gi)
## 
## --- filtering 2-arm duplexes by junction type  ---
## Duplexes categorized by chimeric junction span: 2090
## Duplexes with normal junctions: 1941: 92.87 %
## Duplexes with self-overlap: 7: 0.33 %
## Duplexes with antisense self-overlap: 0: 0 %
## Duplexes with the junction shorter than 10nt): 142: 6.79 %
gi <- getSpliceJunctionChimeras(gi, sj_gr = SampleSpliceJncGR)
## 
## --- searching for the exon-exon junctions  ---
## Chimeras to test against splice junctions: 2090
## SJ entries  : 1325: 63.397 % of all
## SJ entries / single chr entries: 1325/2063 : 64.23 % of single chromosome chimeras

One can inspect detected junction types

table(gi$junction_type)
## 
##          2arm    2arm_short      self_ovl antisense_ovl 
##          1941           142             7             0

and amount of reads which are not likely to come from true RNA duplexes, because their chimeric junctions are too similar to normal exon-exon junction

table(gi$splicejnc)
## 
##    0    1 
##  765 1325

For determining the duplex groups, we will leave only chimeric junctions in a read which do not self-overlap:

keep <- which((gi$junction_type == "2arm") & (gi$splicejnc == 0))
gi <- gi[keep]

4.3 Clustering reads into duplex groups

The most basic way to find the duplex groups is to simply call

gi <- clusterDuplexGroups(gi)

This procedure will:

Find overlaps between all chimeric read pairs.
Represent reads as a directed graph, where edges are weighted with the amount of overlap between the reads.
Call a community search algorithm on the graph which finds the read clusters.
Add the read cluster IDs as the dg_id field to the input object. NA values are used for reads which are not a part of any read duplex group (DG)

table(is.na(gi$dg_id))
## 
## FALSE  TRUE 
##   166   450

To collapse reads to the DGs with the boundaries, containing every read of DG, one can call

gi_dgs <- collapse_duplex_groups(gi)
head(gi_dgs, 1)
## StrictGInteractions object with 1 interaction and 3 metadata columns:
##       seqnames1           ranges1     seqnames2           ranges2 |   n_reads
##           <Rle>         <IRanges>         <Rle>         <IRanges> | <numeric>
##   [1]     chr22 39314686-39314744 ---     chr22 39314738-39314827 |         7
##           dg_id     score
##       <integer> <numeric>
##   [1]         1         1
##   -------
##   regions: 92 ranges and 0 metadata columns
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# number of DGs
length(gi_dgs)
## [1] 46

The above procedure can be applied to any subset of data. If the user is interested in the RNA structure in a particular region, they can select all reads overlapping with this region and repeat the clustering several times. By adjusting the parameters (minimum overlap, maximum allowed distance between alignment arms, coordinates of coordinates) one can obtain more refined or larger duplex groups for improving subsequent prediction of RNA secondary structure.

4.4 Customization of the read clustering

4.4.1 Collapsing identical reads

For large datasets, it is possible to identify and collapse identical reads – those that map to the same coordinates with the same score. In overlap-based clustering, these reads would belong to the same duplex groups. Collapsing them into a single entry reduces the number of nodes and vertices in the clustering graph, resulting in better time and memory performance.

res_collapse <- collapseIdenticalReads(gi)
## Duplicated   :  1.95% of initial
## Initial size :  616
## New size     :  604
# returns new object
gi_unique <- res_collapse$gi_collapsed

This procedure adds duplex_id column as new id and removes read_id, because single entry in a new object can correspond to multiple reads. n_reads records the number of reads in this temporary duplex group.

head(res_collapse$stats_df, 1)
## # A tibble: 1 × 3
##   read_id duplex_id n_reads_collapsed
##     <int>     <int>             <dbl>
## 1       3         1                 1

Note that collapse_duplex_groups() method has options to handle temporary duplex groups, so that merging identical reads before clustering would not lose information on the read support for duplex group.

# cluster  gi_unqiue
gi_unique_dg <- collapse_duplex_groups(clusterDuplexGroups(gi_unique))
# check if the get the same amount of reads as in basic clustering
sum(gi_unique_dg$n_reads) == sum(gi_dgs$n_reads)
## [1] TRUE

4.4.2 Custom weights for clustering

There is a possibility for a user to re-define the rules by which overlaps between reads are weighted before community search.

This might be a customization related to protocols or the size of the dataset. Possible scenarios are weighting the clustering graph vertices with hybridization energies calculated directly on the reads instead of duplex groups, or employing other heuristic assumptions on how RNA duplexes are translated into chimeric read fragments. e.g in RIL-seq, the second chimeric arm is more likely to be a target RNA rather than sRNA.

To demonstrate such a modification, we will first compute the overlap graph manually, using the following procedure, second, modify the graph weights, and third, call the clustering.

First, we find pair overlaps in the input GInteractions object

graph_df <- computeGISelfOverlaps(gi_unique)
head(graph_df)
## # A tibble: 6 × 9
##   duplex_id.1 duplex_id.2 A_span B_span A_ovl B_ovl ratio.A ratio.B weight
##         <int>       <int>  <int>  <int> <int> <int>   <dbl>   <dbl>  <dbl>
## 1           2         173     34     51    23    28   0.676   0.549  1.23 
## 2           2         402     35     70    20    19   0.571   0.271  0.843
## 3           2         437     38     51    25    43   0.658   0.843  1.50 
## 4           2         481     32     55    18    46   0.562   0.836  1.40 
## 5           2         501     48     79    19    27   0.396   0.342  0.738
## 6          12         310     69     44    42    26   0.609   0.591  1.20

The size of this dataframe is determined by the presence of pairwise overlap between both arms of read alignments. The first two columns represent the element index or duplex_id of the corresponding reads. It can be filtered or supplemented with other relevant metrics of the pairwise relationship between the reads.

Then we change the rule for how we handle overlap between reads. In this example we will manually change the span-to-overlap ratio per pair or overlapping alignment arms. Read pairs that overlap for less than half of their length will not be treated as reliable to be considered part of the same DG.

graph_df <- graph_df[graph_df$ratio.A > 0.5 & graph_df$ratio.B > 0.5, ]

We will use the sum of the overlap ratios as the new weights and re-scale them to {0,1}.

rescale_vec <- function(x) {
    return((x - min(x)) / (max(x) - min(x)))
}

graph_df$weight_new <- graph_df$ratio.A + graph_df$ratio.B
graph_df$weight_new <- rescale_vec(graph_df$weight_new)
head(graph_df, 2)
## # A tibble: 2 × 10
##   duplex_id.1 duplex_id.2 A_span B_span A_ovl B_ovl ratio.A ratio.B weight
##         <int>       <int>  <int>  <int> <int> <int>   <dbl>   <dbl>  <dbl>
## 1           2         173     34     51    23    28   0.676   0.549   1.23
## 2           2         437     38     51    25    43   0.658   0.843   1.50
## # ℹ 1 more variable: weight_new <dbl>

This dataframe then can be passed to the clustering function:

gi_clust_adj <- clusterDuplexGroups(gi_unique, graphdf = graph_df, weight_column = "weight_new")
gi_clust_adj_dgs <- collapse_duplex_groups(gi_clust_adj)

We get smaller number of the DGs, as some of them are not assembled after restricting the minimum required overlap

length(gi_clust_adj)
## [1] 604

5 Session information

sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] UpSetR_1.4.0                Gviz_1.51.0                
##  [3] DuplexDiscovereR_1.1.0      InteractionSet_1.35.0      
##  [5] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [7] MatrixGenerics_1.19.0       matrixStats_1.4.1          
##  [9] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
## [11] IRanges_2.41.0              S4Vectors_0.45.0           
## [13] BiocGenerics_0.53.0         BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3       rstudioapi_0.17.1        jsonlite_1.8.9          
##   [4] magrittr_2.0.3           magick_2.8.5             GenomicFeatures_1.59.0  
##   [7] farver_2.1.2             rmarkdown_2.28           BiocIO_1.17.0           
##  [10] zlibbioc_1.53.0          vctrs_0.6.5              memoise_2.0.1           
##  [13] Rsamtools_2.23.0         RCurl_1.98-1.16          base64enc_0.1-3         
##  [16] tinytex_0.53             htmltools_0.5.8.1        S4Arrays_1.7.0          
##  [19] progress_1.2.3           curl_5.2.3               SparseArray_1.7.0       
##  [22] Formula_1.2-5            sass_0.4.9               bslib_0.8.0             
##  [25] htmlwidgets_1.6.4        plyr_1.8.9               httr2_1.0.5             
##  [28] cachem_1.1.0             GenomicAlignments_1.43.0 igraph_2.1.1            
##  [31] lifecycle_1.0.4          pkgconfig_2.0.3          Matrix_1.7-1            
##  [34] R6_2.5.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.13 
##  [37] digest_0.6.37            colorspace_2.1-1         AnnotationDbi_1.69.0    
##  [40] Hmisc_5.2-0              RSQLite_2.3.7            labeling_0.4.3          
##  [43] filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
##  [46] abind_1.4-8              compiler_4.5.0           withr_3.0.2             
##  [49] bit64_4.5.2              htmlTable_2.4.3          backports_1.5.0         
##  [52] BiocParallel_1.41.0      DBI_1.2.3                highr_0.11              
##  [55] biomaRt_2.63.0           rappdirs_0.3.3           DelayedArray_0.33.0     
##  [58] rjson_0.2.23             ggsci_3.2.0              tools_4.5.0             
##  [61] foreign_0.8-87           nnet_7.3-19              glue_1.8.0              
##  [64] restfulr_0.0.15          checkmate_2.3.2          cluster_2.1.6           
##  [67] generics_0.1.3           gtable_0.3.6             BSgenome_1.75.0         
##  [70] tidyr_1.3.1              ensembldb_2.31.0         data.table_1.16.2       
##  [73] hms_1.1.3                xml2_1.3.6               utf8_1.2.4              
##  [76] XVector_0.47.0           pillar_1.9.0             stringr_1.5.1           
##  [79] dplyr_1.1.4              BiocFileCache_2.15.0     lattice_0.22-6          
##  [82] rtracklayer_1.67.0       bit_4.5.0                deldir_2.0-4            
##  [85] biovizBase_1.55.0        tidyselect_1.2.1         Biostrings_2.75.0       
##  [88] knitr_1.48               gridExtra_2.3            bookdown_0.41           
##  [91] ProtGenerics_1.39.0      xfun_0.48                stringi_1.8.4           
##  [94] UCSC.utils_1.3.0         lazyeval_0.2.2           yaml_2.3.10             
##  [97] evaluate_1.0.1           codetools_0.2-20         interp_1.1-6            
## [100] tibble_3.2.1             BiocManager_1.30.25      cli_3.6.3               
## [103] rpart_4.1.23             munsell_0.5.1            jquerylib_0.1.4         
## [106] dichromat_2.0-0.1        Rcpp_1.0.13              dbplyr_2.5.0            
## [109] png_0.1-8                XML_3.99-0.17            parallel_4.5.0          
## [112] ggplot2_3.5.1            blob_1.2.4               prettyunits_1.2.0       
## [115] latticeExtra_0.6-30      jpeg_0.1-10              AnnotationFilter_1.31.0 
## [118] bitops_1.0-9             VariantAnnotation_1.53.0 scales_1.3.0            
## [121] purrr_1.0.2              crayon_1.5.3             rlang_1.1.4             
## [124] KEGGREST_1.47.0