1 Introduction
2 Installation
3 ClustIRR algorithm
4 Case study: analysis of three TCR repertoires

library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
library(ggrepel)
library(patchwork)

1 Introduction

Adaptive immunity relies on diverse immune receptor repertoires (IRRs: B- and T-cell receptor repertoires) to protect the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, and cancers. The sequence diversity of B- and T-cell receptors (BCRs and TCRs) originates, in part, from V(D)J recombination, a process in which different germline-encoded genes are joined to form unique immune receptors. As a result, practically every newly formed naive mature T and B cell is equipped with a distinct immune receptor (IR), enabling them to recognize unique sets of antigens.

B cells bind antigens directly through the complementarity-determining regions (CDRs) of their BCRs, while T cells recognize antigenic peptides presented by major histocompatibility complex (MHC) molecules via the CDRs of their TCRs. Antigen recognition can lead to B/T cell activation, causing these cells to rapidly proliferate and form antigen-specific clones capable of mounting an effective immune response.

Recent studies have shown that sequence similarity between TCRs often indicates shared antigen specificity. Therefore, by clustering TCR sequences from high-throughput sequencing (HT-seq) data, we can identify communities of TCRs with shared antigen specificity. By tracking the dynamics of these communities over time or across biological conditions, we may learn how our immune system responds to e.g. cancer immunotherapies, vaccines, and antiviral drugs, which can help us improve these treatments.

This vignette introduces ClustIRR, a computational method that detects communities of immune receptors with similar specificity and employs Bayesian models to quantify differential community occupancy between IRRs from different biological conditions (e.g. before and after cancer treatment).

2 Installation

ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.

To install ClustIRR please start R and enter:

if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")

3 ClustIRR algorithm

3.1 Input

The main input of ClustIRR is an IRR (s), which should be provided as data.frame. The rows in the data.frame correspond to clones (clone = group of cells derived from a common parent cell by clonal expansion). We use the following data from each clone:

Amino acid sequences of the complementarity determining regions 3 from one or both chains (e.g. CDR3$\alpha$ and CDR3$\beta$ from $\alpha\beta$ TCRs).
Clone size, which refers to the frequency of cells that belong to the clone.

In a typical scenario, the user will have more than one IRR (see workflow). For instance, the user will analyze longitudinal IRR data, i.e., two or three IRRs taken at different time points; or across different biological conditions.

Let’s have a look at an example IRR: two TCR$\alpha\beta$ repertoires $a$ and $b$.

data("CDR3ab", package = "ClustIRR")

set.seed(127)
n <- 300

# get 300 CDR3a and CDR3b pairs from the data -> IRR a
a <- data.frame(CDR3a = CDR3ab$CDR3a[1:n], 
                CDR3b = CDR3ab$CDR3b[1:n], 
                clone_size=rpois(n = n, lambda = 3)+1, 
                sample = "a")

# get 200 CDR3a and CDR3b pairs from the data -> IRR b
b <- data.frame(CDR3a = CDR3ab$CDR3a[101:(n+100)], 
                CDR3b = CDR3ab$CDR3b[101:(n+100)], 
                clone_size=rpois(n = n, lambda = 3)+1, 
                sample = "b")

str(a)

'data.frame':   300 obs. of  4 variables:
 $ CDR3a     : chr  "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
 $ CDR3b     : chr  "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
 $ clone_size: num  3 2 3 2 3 5 4 3 5 5 ...
 $ sample    : chr  "a" "a" "a" "a" ...

str(b)

'data.frame':   300 obs. of  4 variables:
 $ CDR3a     : chr  "CASSRLEGSDTQYF" "CASTEYGNTIYF" "CAWSVLSDTQYF" "CSVERDRADGYTF" ...
 $ CDR3b     : chr  "CSARGRTSVNEQFF" "CSVAEGLGTEAFF" "CVSSSTGNNQPQHF" "CATSRPGGEWETQYF" ...
 $ clone_size: num  2 4 4 5 3 4 9 3 5 4 ...
 $ sample    : chr  "b" "b" "b" "b" ...

3.2 Step 1. IRR analysis with `cluster_irr`

3.2.1 Theory: how to compute similarities between IR clones?

IRRs, such as T-cell receptor repertoires, are made up of T-cells which are distributed over T-cell clones. TCR clones with identical pairs of CDR3$\alpha$ and CDR3$\beta$ sequences most likely recognize the same sets of antigens. Meanwhile, TCR clones with similar pairs of CDR3$\alpha$ and CDR3$\beta$ sequences may also share common specificity. ClustIRR aims to quantify the similarity between pairs of TCR clones based on the similarities of their CDR3s sequences.

How to compute a similarity score between a pair of CDR3 sequences?

Pair of sequences, $a$ and $b$, are aligned with the Needleman-Wunsch algorithm. The output is an alignment score ($\omega$). Identical or similar CDR3 sequence pairs get a large positive $\omega$, and dissimilar CDR3 sequence pairs get a low (or even negative) $\omega$.

To make sure that $\omega$ is comparable across pairs of CDR3s with different lengths, ClustIRR divides (normalizes) $\omega$ by the length of the longest CDR3 sequences in each pair: \[\begin{align} \bar{\omega} = \dfrac{\omega}{\max(|a|, |b|)} \end{align}\] where $|a|$ and $|b|$ are the lengths of CDR3 sequences $a$ and $b$; and $\bar{\omega}$ is the normalized alignment score.

The CDR3 cores, which represent the central parts of the CDR3 loop and tend to have high probability of making a contact with the antigen, are also compared. ClustIRR constructs the CDR3 cores by trimming few residues (defined by control$trim_flanks) from either end of each CDR3 sequences. These are then aligned and scored based on the same algorithm, yielding for each pair of CDR3 cores a normalized alignment scores $\bar{\omega}_c$.

This strategy is computationally very expensive!

For large IRRs with $n>10^6$ this algorithm requires significant computational resources. To mitigate this challenge, we employ a screening step in which dissimilar sequences pairs are flagged. In short, each CDR3 is used as a query in a fast protein-BLAST search as implemented in the R-package blaster, while the remaining CDR3s are considered as a database of amino acid sequences against which the query is compared. CDR3 sequences which share at least 70% sequence identity (user parameter control$gmi) with the query are selected, and only these are aligned with query CDR3. For the remaining CDR3 pairs we assume $\bar{\omega}=0$.

3.2.2 Example

Step 1. involves calling the function clust_irr which returns an S4 object of class clust_irr.

# perform clust_irr analysis of repertoire a
c_a <- cluster_irr(s = a, control = list(gmi = 0.7))

# ... and b
c_b <- cluster_irr(s = b, control = list(gmi = 0.7))

Next, we show the chain-specific similarity scores between CDR3s sequences. Each row is a pair of CDR3 sequences from the repertoire. For each pair we have the following metrics:

max_len: length of the longer CDR3 sequence in the pair
max_clen: length of the longer CDR3 core sequence in the pair
weight: $\omega$ = BLOSUM62 score of the complete CDR3 alignment
cweight= $\omega_c$: BLOSUM62 score of CDR3 cores
nweight = $\bar{\omega}$: normalized weight by max_len
ncweight = $\bar{\omega}_c$: normalized cweight by max_clen

The results for CDR3a:

kable(head(c_a@clust$CDR3a), digits = 2)

from_cdr3	to_cdr3	weight	cweight	max_len	max_clen	nweight	ncweight
CSATASGGSYNEQFF	CAWTLSGSSYNEQFF	60	36	15	9	4.00	4.00
CSVDALAGVSETQYF	CASSSETQYF	13	-16	15	9	0.87	-1.78
CSVEKNTEAFF	CSVETGKNTEAFF	41	8	13	7	3.15	1.14
CSVLTVSSYNEQFF	CASSAGTGSSYNEQFF	36	8	16	10	2.25	0.80
CASSVGEATNEKLFF	CASSLEADHEKLFF	43	10	15	9	2.87	1.11
CASSYARTTEAFF	CASSYISMNTEAFF	40	7	14	8	2.86	0.88

… the same table as CDR3b sequence pairs:

kable(head(c_a@clust$CDR3a), digits = 2)

from_cdr3	to_cdr3	weight	cweight	max_len	max_clen	nweight	ncweight
CSATASGGSYNEQFF	CAWTLSGSSYNEQFF	60	36	15	9	4.00	4.00
CSVDALAGVSETQYF	CASSSETQYF	13	-16	15	9	0.87	-1.78
CSVEKNTEAFF	CSVETGKNTEAFF	41	8	13	7	3.15	1.14
CSVLTVSSYNEQFF	CASSAGTGSSYNEQFF	36	8	16	10	2.25	0.80
CASSVGEATNEKLFF	CASSLEADHEKLFF	43	10	15	9	2.87	1.11
CASSYARTTEAFF	CASSYISMNTEAFF	40	7	14	8	2.86	0.88

3.2.3 Annotation of CDR3s

The function clust_irr performs automatic annotation of TCR clones based on databases (DBs) including: VDJdb, TCR3d, McPAS-TCR. The control parameter control$db_edit=0 (default) controls an edit distance threshold. If the edit distance between an input CDR3 and a DB CDR3 sequence is smaller then or equal to control$db_edit, then the input CDR3s inherits the antigen specificity data of the DB CDR3s.

To access these annotations see:

# control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL)
kable(head(get_clustirr_inputs(c_a)$s), digits = 2)

CDR3a	CDR3b	clone_size	sample	id	info_vdjdb_CDR3b	info_mcpas_CDR3b
CASSLRGAHNEQFF	CASTVTSGSNEKLFF	3	a	1	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>
CASGLRQGYGYTF	CASSLTGTGYTF	2	a	2	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>
CSAGGFRETQYF	CSALTPGLIYNEQFF	3	a	3	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>
CGSSLSQGSEQYF	CSARASWGTNEKLFF	2	a	4	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>
CSPRGPYGYTF	CASSVREAENQPQHF	3	a	5	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>
CSLGPSKSQYF	CASCTVGNQPQHF	5	a	6	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>

3.3 Step 2. building a graph with `get_graph` or `get_joint_graph`

Next, ClustIRR builds a graph. If we analyze one IRR, then we may employ the function get_graph, which converts the clust_irr object into an igraph object. Meanwhile, if we are analyzing two ore more IRRs, then we can use the function get_joint_graph to generate a joint igraph object. In this case, edges between TCR clones from different IRRs are computed using the same procedure outlined in step 1.

The graphs have nodes and weighted edges:

nodes: clones from each IRR. Each clone attribute (clone size, CDR3 sequences, etc) is provided as node attribute
edges: connections between nodes (clones) in each IRR (computed in step 1.)

g <- get_graph(clust_irr = c_a)

plot_graph(g, as_visnet = TRUE)

The graph is an igraph object. We can use the igraph functions to inspect different properties of the graph, such as, the distribution of edge weights (shown below). Notice, that the edge weights vary drastically between the edges.

# data.frame of edges and their attributes
e <- igraph::as_data_frame(x = g$graph, what = "edges")

kable(head(e), digits = 2)

from	to	weight	cweight	nweight	ncweight	max_len	max_clen	type	chain	clustering
a\|39	a\|69	53	20	4.08	2.86	13	7	within-repertoire	CDR3b	global
a\|151	a\|211	44	9	2.93	1.00	15	9	within-repertoire	CDR3b	global
a\|211	a\|229	53	18	3.79	2.25	14	8	within-repertoire	CDR3b	global
a\|76	a\|187	54	21	3.60	2.33	15	9	within-repertoire	CDR3b	global
a\|16	a\|69	35	2	2.50	0.25	14	8	within-repertoire	CDR3b	global
a\|16	a\|218	46	13	3.29	1.62	14	8	within-repertoire	CDR3b	global

Below we show the distributions of the edge attributes ncweight and nweight between CDR3$\alpha$ and CDR3$\beta$ sequence pairs in the IRR.

ggplot(data = e)+
  geom_density(aes(ncweight, col = chain))+
  geom_density(aes(nweight, col = chain), linetype = "dashed")+
  theme_bw()+
  xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
  theme(legend.position = "top")

Here we have two IRRs. We can use the function by get_joint_graph to create a joint graph. This function computes edges between the TCR clones from the different IRRs (as described in step 1.). We do this in the following code blocks.

g <- get_joint_graph(clust_irrs = c(c_a, c_b))

plot_graph(g = g, as_visnet = TRUE, node_opacity = 0.8)

3.4 Step 3. community detection

ClustIRR employs graph-based community detection (GCD) algorithms, such as Louvain or Leiden, to identify densely connected communities.

But first, we must decide how to compute a similarity between two nodes, $i$ and $j$, (e.g. TCR clones) based on the similarity scores between their CDR3 sequences (compute in step 1.). We will refer to this metric as $\omega(i,j)$.

3.4.1 Scenario 1: we have CDR3 sequences from one chain, e.g. CDR3$\beta$

If the data contains CDR3 sequences from only one chain, such as CDR3$\beta$, then $\omega(i,j)$ is defined as \[\begin{align} \omega(i,j)={\bar{\omega}}^\beta \qquad\text{or}\qquad \omega(i,j)={\bar{\omega}}^\beta_c \end{align}\] The user can decide among the two definitions by specifying weight = “ncweight” (default; $\omega(i,j)=\bar{\omega_c}$) or weight = “nweight” (default; $\omega(i,j)=\bar{\omega}$).

3.4.2 Scenario 2: we have CDR3 sequences from both chains (paired data)

To compute the similarity score between TCR clones, $i$ and $j$, we compute the average alignment score from their CDR3$\alpha$ and CDR3$\beta$ alignment scores (in the next, I will use TCR$\alpha\beta$ as an example, however, this approach can also be used to compare TCR$\gamma\delta$ or BCRIgH-IgL clones): \[\begin{align} \omega(i,j)=\dfrac{{\bar{\omega}}^\alpha + {\bar{\omega}}^\beta}{2} \qquad\text{or}\qquad \omega(i,j)=\dfrac{{\bar{\omega}}^\alpha_c + {\bar{\omega}}^\beta_c}{2}, \end{align}\] where $\bar{\omega}^\alpha$ and $\bar{\omega}^\beta$ are the alignment scores for the CDR3$\alpha$ and CDR3$\beta$ sequences, respectively; and $\bar{\omega}^\alpha_c$ and $\bar{\omega}^\beta_c$ are the alignment scores for the CDR3$\alpha$ and CDR3$\beta$ cores, respectively. Based on this metric, the contributions of CDR3$\alpha$ and CDR3$\beta$ towards the overall similarity of the TCR clones are assigned equal weights.

ClustIRR provides two additional metrics for computing similarity scores between TCR clones, including a strict metric, which assigns high similarity score to a pair of TCR clones only if both of their CDR3$\alpha$ and CDR3$\beta$ sequence pairs are similar \[\begin{align} \omega^s(i,j) = \min({\bar{\omega}}^\alpha, {\bar{\omega}}^\beta) \qquad\text{or}\qquad \omega^s(i,j) = \min({\bar{\omega}}^\alpha_c, {\bar{\omega}}^\beta_c), \end{align}\] and a loose metric, which assigns high similarity score to a pair of TCR clones if either of their CDR3$\alpha$ and CDR3$\beta$ sequence pairs are similar \[\begin{align} \omega^l(i,j) = \max({\bar{\omega}}^\alpha, {\bar{\omega}}^\beta) \qquad\text{or}\qquad \omega^l(i,j) = \max({\bar{\omega}}^\alpha_c, {\bar{\omega}}^\beta_c), \end{align}\]

The user has the following options:

algorithm: “leiden” (default) or “louvain”
resolution: GCD resolution = 1 (default)
weight: “ncweight” (default) or “nweight”
metric: “average” (default), “strict” or “loose”
chains: “CDR3a” or “CDR3b” or c(“CDR3a”, “CDR3b”)

gcd <- detect_communities(graph = g$graph, 
                          algorithm = "leiden",
                          resolution = 1,
                          weight = "ncweight",
                          metric = "average",
                          chains = c("CDR3a", "CDR3b"))

The function detect_communities generates a complex output. Lets investigate its elements:

names(gcd)

[1] "community_occupancy_matrix" "community_summary"         
[3] "node_summary"               "graph"                     
[5] "input_config"

The main element is community_occupancy_matrix, which contains the number of T-cells in each community (row) and IRR (column). Here we have two IRRs (two columns) and about 300 communities. This matrix is the main input of the function dco (step 4.), to detect differences in the community occupancy between IRRs.

dim(gcd$community_occupancy_matrix)

[1] 320   2

head(gcd$community_occupancy_matrix)

plot(x = gcd$community_occupancy_matrix[, "a"],
     y = gcd$community_occupancy_matrix[, "b"],
     xlab = "community occupancy [cells in IRR a]",
     ylab = "community occupancy [cells in IRR b]")

Also see community_summary. Here we provide community-specific summaries, as rows, including:

clones_a, clone_b, clones_n: the frequency of clones in the community coming from IRR a, b and in total (n)
cells_a, cells_b, cells_n: the frequency of cell in the community coming from IRR a, b and in total (n)
w: the mean inter-clone similarity ($\omega(i,j)$)
w_CDR3a, w_CDR3b: the contributions of CDR3$\alpha$ and CDR3$\beta$ to w
n_CDR3a, n_CDR3b: number of edges between CDR3$\alpha$ and CDR3$\beta$ sequences

kable(head(gcd$community_summary), digits = 2)

community	clones_a	clones_b	clones_n	cells_a	cells_b	cells_n	w	w_CDR3a	w_CDR3b	n_CDR3a	n_CDR3b
1	2	0	2	7	0	7	1.25	2.50	0.00	1	0
2	1	0	1	2	0	2	0.00	0.00	0.00	0	0
3	1	0	1	3	0	3	0.00	0.00	0.00	0	0
4	1	0	1	2	0	2	0.00	0.00	0.00	0	0
5	2	1	3	13	5	18	2.61	1.74	3.48	1	3
6	1	0	1	5	0	5	0.00	0.00	0.00	0	0

What is the contribution of CDR3a and CDR3b weights to the formation of communities?

Notice the big dot at coordinatex 0,0. Communities made up from a single node have no within-community edges $\rightarrow$

w_CDR3a = 0
w_CDR3b = 0
w = 0

ggplot(data = gcd$community_summary)+
  geom_point(aes(x = w_CDR3a, y = w_CDR3b, size = cells_n), shape=21)+
  xlab(label = "CDR3a ncweight")+
  ylab(label = "CDR3b ncweight")+
  scale_size_continuous(range = c(0.5, 5))+
  theme_bw(base_size = 10)

Node-specific (TCR clone-specific) summaries are provided in node_summary

kable(head(gcd$node_summary), digits = 2)

	name	clone_id	info_mcpas_CDR3b	info_vdjdb_CDR3b	sample	clone_size	CDR3b	CDR3a	community
a\|1	a\|1	1	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	3	CASTVTSGSNEKLFF	CASSLRGAHNEQFF	1
a\|2	a\|2	2	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	2	CASSLTGTGYTF	CASGLRQGYGYTF	2
a\|3	a\|3	3	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	3	CSALTPGLIYNEQFF	CSAGGFRETQYF	3
a\|4	a\|4	4	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	2	CSARASWGTNEKLFF	CGSSLSQGSEQYF	4
a\|5	a\|5	5	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	3	CASSVREAENQPQHF	CSPRGPYGYTF	5
a\|6	a\|6	6	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	a	5	CASCTVGNQPQHF	CSLGPSKSQYF	6

3.5 Step 4. differential community occupancy (DCO)

Do we see growing or shrinking communities in a given IRRs?

We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities in each IRR (minimum number of IRRs = 2).

For DCO analysis of two IRRs The model output is the parameter $\delta=\delta_1,\delta_2,\ldots,\delta_k$, where $k$ is the number of communities. Growing community $i$ between IRR $a$ vs. IRR $b$, results in $\delta_i>0$, shrinking community $i$ results in $\delta_i < 0$.

For DCO analysis of more than two IRRs The model output for IRR $a$ is the parameter vector $\beta^a=\beta^a_1,\beta^a_2,\ldots,\beta^a_k$, which describes the effect of IRR $a$ on the relative occupancy in each community.

Given two IRRs, $a$ and $b$, we can quantify the differential community occupancy (DCO): \[\begin{align} \delta^{a-b}_i = \beta^a_i - \beta^b_i \end{align}\]

d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
         mcmc_control = list(mcmc_warmup = 500,
                             mcmc_iter = 1500,
                             mcmc_chains = 3,
                             mcmc_cores = 1,
                             mcmc_algorithm = "NUTS",
                             adapt_delta = 0.9,
                             max_treedepth = 10))


SAMPLING FOR MODEL 'dm_n2' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000229 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.29 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 1500 [  0%]  (Warmup)
Chain 1: Iteration:  150 / 1500 [ 10%]  (Warmup)
Chain 1: Iteration:  300 / 1500 [ 20%]  (Warmup)
Chain 1: Iteration:  450 / 1500 [ 30%]  (Warmup)
Chain 1: Iteration:  501 / 1500 [ 33%]  (Sampling)
Chain 1: Iteration:  650 / 1500 [ 43%]  (Sampling)
Chain 1: Iteration:  800 / 1500 [ 53%]  (Sampling)
Chain 1: Iteration:  950 / 1500 [ 63%]  (Sampling)
Chain 1: Iteration: 1100 / 1500 [ 73%]  (Sampling)
Chain 1: Iteration: 1250 / 1500 [ 83%]  (Sampling)
Chain 1: Iteration: 1400 / 1500 [ 93%]  (Sampling)
Chain 1: Iteration: 1500 / 1500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 3.829 seconds (Warm-up)
Chain 1:                6.973 seconds (Sampling)
Chain 1:                10.802 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'dm_n2' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000189 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.89 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 1500 [  0%]  (Warmup)
Chain 2: Iteration:  150 / 1500 [ 10%]  (Warmup)
Chain 2: Iteration:  300 / 1500 [ 20%]  (Warmup)
Chain 2: Iteration:  450 / 1500 [ 30%]  (Warmup)
Chain 2: Iteration:  501 / 1500 [ 33%]  (Sampling)
Chain 2: Iteration:  650 / 1500 [ 43%]  (Sampling)
Chain 2: Iteration:  800 / 1500 [ 53%]  (Sampling)
Chain 2: Iteration:  950 / 1500 [ 63%]  (Sampling)
Chain 2: Iteration: 1100 / 1500 [ 73%]  (Sampling)
Chain 2: Iteration: 1250 / 1500 [ 83%]  (Sampling)
Chain 2: Iteration: 1400 / 1500 [ 93%]  (Sampling)
Chain 2: Iteration: 1500 / 1500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 3.792 seconds (Warm-up)
Chain 2:                6.946 seconds (Sampling)
Chain 2:                10.738 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'dm_n2' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000192 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 1.92 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 1500 [  0%]  (Warmup)
Chain 3: Iteration:  150 / 1500 [ 10%]  (Warmup)
Chain 3: Iteration:  300 / 1500 [ 20%]  (Warmup)
Chain 3: Iteration:  450 / 1500 [ 30%]  (Warmup)
Chain 3: Iteration:  501 / 1500 [ 33%]  (Sampling)
Chain 3: Iteration:  650 / 1500 [ 43%]  (Sampling)
Chain 3: Iteration:  800 / 1500 [ 53%]  (Sampling)
Chain 3: Iteration:  950 / 1500 [ 63%]  (Sampling)
Chain 3: Iteration: 1100 / 1500 [ 73%]  (Sampling)
Chain 3: Iteration: 1250 / 1500 [ 83%]  (Sampling)
Chain 3: Iteration: 1400 / 1500 [ 93%]  (Sampling)
Chain 3: Iteration: 1500 / 1500 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 3.84 seconds (Warm-up)
Chain 3:                6.966 seconds (Sampling)
Chain 3:                10.806 seconds (Total)
Chain 3:

3.6 Step 5. posterior predictive check

Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.

General idea of posterior predictive checks:

fit model based on data $y$
simulate new data $\hat{y}$
compare $y$ and $\hat{y}$

ClustIRR provides $y$ and $\hat{y}$ of each IRR, which we can visualize with ggplot:

ggplot(data = d$posterior_summary$y_hat)+
  facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
                col = "darkgray", width=0)+
  geom_point(aes(x = y_obs, y = mean), size = 0.8)+
  xlab(label = "observed y")+
  ylab(label = "predicted y (and 95% HDI)")+
  theme_bw(base_size = 10)

3.7 Step 6. $\delta$ parameters

We can compare DAC in two directions: $a$ vs. $b$ or $b$ vs. $a$

ggplot(data = d$posterior_summary$delta)+
  facet_wrap(facets = ~contrast, ncol = 1)+
  geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), 
                col = "darkgray", width =0)+
  geom_point(aes(x = community, y = mean), size = 0.5)+
  theme_bw(base_size = 10)+
  theme(legend.position = "none")+
  ylab(label = expression(delta))+
  scale_x_continuous(expand = c(0,0))

Distribution of mean $\delta$s

ggplot(data = d$posterior_summary$delta)+
  geom_histogram(aes(mean), bins = 100)+
  xlab(label = expression(delta))+
  theme_bw(base_size = 10)

3.8 Conclusion: you can also use custom community occupancy matrix for DCO!

The function dco takes as its main input a community occupancy matrix. This enables users who are accustomed to using complementary algorithm for detecting specificity groups, such as, GLIPH, TCRdist3, GIANA, and iSMART, to skip steps 1-3 of the ClustIRR workflow, and to proceed with analysis for DCO.

4 Case study: analysis of three TCR repertoires

Imagine that we have three IRRs, $a$, $b$ and $c$, obtained from one patient at three timepoints.

# repertoire size
n <- 200

# a
a <- data.frame(CDR3a = CDR3ab$CDR3a[1:n], 
                CDR3b = CDR3ab$CDR3b[1:n],
                sample = "a")
# b
b <- data.frame(CDR3a = CDR3ab$CDR3a[1:n], 
                CDR3b = CDR3ab$CDR3b[1:n],
                sample = "b")

# c
c <- data.frame(CDR3a = CDR3ab$CDR3a[1:n], 
                CDR3b = CDR3ab$CDR3b[1:n],
                sample = "c")

get_clonal_expansion <- function(n, p_expanded) {
  s <- sample(x = c(0, 1), size = n, prob = c(1-p_expanded, 
                                              p_expanded), replace = T)
  y <- vapply(X = s, FUN.VALUE = numeric(1), FUN = function(x) {
    if(x == 0) {
      return(rpois(n = 1, lambda = 0.5))
    }
    return(rpois(n = 1, lambda = 50))
  })
  return(y)
}

# simulate expansion of specific communities
set.seed(1243)
clone_size <- rpois(n = n, lambda = 3)+1
expansion_factor <- get_clonal_expansion(n = n, p_expanded = 0.02)

a$clone_size <- clone_size
b$clone_size <- clone_size+expansion_factor*1
c$clone_size <- clone_size+expansion_factor*2

4.1 Step 1. `cluster_irr` analyzed each TCRs repertoire

# run cluster_irr on each IRR and join the results
clust_irrs <- c(cluster_irr(s = a), cluster_irr(s = b), cluster_irr(s = c))

4.2 Step 2. `get_graph` and `plot_graph` visualize specificity structures

We can also plot a graph of the global specificity structure in TCR repertoire $a$, $b$ and $c$.

g <- get_joint_graph(clust_irrs = clust_irrs, cores = 1)

plot_graph(g = g, as_visnet = TRUE, node_opacity = 0.8)

4.3 Step 3. `detect_communities` identifies communities in the graph

Are there densely connected sets of nodes (=communities) in this graph?

To answer this question we can use graph-based community detection (GCD) algorithms, such as Leiden or Louvain. As input for GCD we can use nweight or ncweight (default) between CDR3a, CDR3b or both CDR3a and CDR3b.

gcd <- detect_communities(graph = g$graph,
                          weight = "ncweight",
                          algorithm = "leiden",
                          resolution = 1,
                          chains = c("CDR3a", "CDR3b"))

How many cells in each community from the three IRRs?

panel A: $a$ vs $b$
panel B: $a$ vs $c$
panel C: $b$ vs $c$

(ggplot(data = gcd$community_summary)+
   geom_point(aes(x = cells_a, y = cells_b), shape = 21)+
   theme_bw(base_size = 10))|
  (ggplot(data = gcd$community_summary)+
     geom_point(aes(x = cells_a, y = cells_c), shape = 21)+
     theme_bw(base_size = 10))|
  (ggplot(data = gcd$community_summary)+
     geom_point(aes(x = cells_b, y = cells_c), shape = 21)+
     theme_bw(base_size = 10))+
  plot_annotation(tag_level = 'A')

The number of cells in each IRR and community are stored as cells in the matrix community_occupancy_matrix. Rows are communities, and columns are IRRs

community_occupancy_matrix <- gcd$community_occupancy_matrix
head(community_occupancy_matrix)

4.4 Step 4. `dco` performs differential community occupancy (DCO) analysis

Do we see growing or shrinking communities in a given IRRs?

We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities, and the differential community occupancy between IRRs.

d <- dco(community_occupancy_matrix = community_occupancy_matrix,
         mcmc_control = list(mcmc_warmup = 300,
                             mcmc_iter = 900,
                             mcmc_chains = 3,
                             mcmc_cores = 1,
                             mcmc_algorithm = "NUTS",
                             adapt_delta = 0.95,
                             max_treedepth = 11))


SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000169 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.69 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 900 [  0%]  (Warmup)
Chain 1: Iteration:  90 / 900 [ 10%]  (Warmup)
Chain 1: Iteration: 180 / 900 [ 20%]  (Warmup)
Chain 1: Iteration: 270 / 900 [ 30%]  (Warmup)
Chain 1: Iteration: 301 / 900 [ 33%]  (Sampling)
Chain 1: Iteration: 390 / 900 [ 43%]  (Sampling)
Chain 1: Iteration: 480 / 900 [ 53%]  (Sampling)
Chain 1: Iteration: 570 / 900 [ 63%]  (Sampling)
Chain 1: Iteration: 660 / 900 [ 73%]  (Sampling)
Chain 1: Iteration: 750 / 900 [ 83%]  (Sampling)
Chain 1: Iteration: 840 / 900 [ 93%]  (Sampling)
Chain 1: Iteration: 900 / 900 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 7.571 seconds (Warm-up)
Chain 1:                12.053 seconds (Sampling)
Chain 1:                19.624 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000145 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.45 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 900 [  0%]  (Warmup)
Chain 2: Iteration:  90 / 900 [ 10%]  (Warmup)
Chain 2: Iteration: 180 / 900 [ 20%]  (Warmup)
Chain 2: Iteration: 270 / 900 [ 30%]  (Warmup)
Chain 2: Iteration: 301 / 900 [ 33%]  (Sampling)
Chain 2: Iteration: 390 / 900 [ 43%]  (Sampling)
Chain 2: Iteration: 480 / 900 [ 53%]  (Sampling)
Chain 2: Iteration: 570 / 900 [ 63%]  (Sampling)
Chain 2: Iteration: 660 / 900 [ 73%]  (Sampling)
Chain 2: Iteration: 750 / 900 [ 83%]  (Sampling)
Chain 2: Iteration: 840 / 900 [ 93%]  (Sampling)
Chain 2: Iteration: 900 / 900 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 7.905 seconds (Warm-up)
Chain 2:                12.155 seconds (Sampling)
Chain 2:                20.06 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'dm' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000187 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 1.87 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 900 [  0%]  (Warmup)
Chain 3: Iteration:  90 / 900 [ 10%]  (Warmup)
Chain 3: Iteration: 180 / 900 [ 20%]  (Warmup)
Chain 3: Iteration: 270 / 900 [ 30%]  (Warmup)
Chain 3: Iteration: 301 / 900 [ 33%]  (Sampling)
Chain 3: Iteration: 390 / 900 [ 43%]  (Sampling)
Chain 3: Iteration: 480 / 900 [ 53%]  (Sampling)
Chain 3: Iteration: 570 / 900 [ 63%]  (Sampling)
Chain 3: Iteration: 660 / 900 [ 73%]  (Sampling)
Chain 3: Iteration: 750 / 900 [ 83%]  (Sampling)
Chain 3: Iteration: 840 / 900 [ 93%]  (Sampling)
Chain 3: Iteration: 900 / 900 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 8.713 seconds (Warm-up)
Chain 3:                11.987 seconds (Sampling)
Chain 3:                20.7 seconds (Total)
Chain 3:

4.5 Step 5. posterior predictive check

General idea of posterior predictive checks:

fit model based on data $y$
simulate new data $\hat{y}$
compare $y$ and $\hat{y}$

ClustIRR provides $y$ and $\hat{y}$ of each IRR, which we can visualize with ggplot:

ggplot(data = d$posterior_summary$y_hat)+
  facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
                col = "darkgray", width=0)+
  geom_point(aes(x = y_obs, y = mean), size = 0.8)+
  xlab(label = "observed y")+
  ylab(label = "predicted y (and 95% HDI)")+
  theme_bw(base_size = 10)

4.6 Step 6. $\beta$ parameters

Notice that some (about 2%) $\beta$ coefficients are far from $\beta=0$.

Remember: we simulated clonal expansion in 2% of the communities!

beta <- d$posterior_summary$beta
ggplot(data = beta)+
  facet_wrap(facets = ~sample, ncol = 1)+
  geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95,
                    col = sample), width = 0)+
  geom_point(aes(x = community, y = mean, col = sample), size = 0.5)+
  theme_bw(base_size = 10)+
  theme(legend.position = "top")+
  ylab(label = expression(beta))+
  scale_x_continuous(expand = c(0,0))

4.7 Step 7. $\delta$ parameters

If a given community $i$ is differentially expressed between two AIRRs, $a$ and $b$, then we may expect to see a difference in the credible values of $\beta^{a}_{i}$ and $\beta^{b}_{i}$. We define this as $\delta^{a-b}_{i}$.

$\delta^{a-b}_{i} = \beta^{a}_{i}-\beta^{b}_{i}$

Lets look at $\delta^{a-b}$, $\delta^{a-c}$ and $\delta^{b-c}$ for different communities. This information is stored in posterior_summary$delta. of the output. We see clear differences ($\delta!=0$) for at least 3 communities.

Remember: we simulated clonal expansion in about 2% of the communities!

delta <- d$posterior_summary$delta
delta <- delta[delta$contrast %in% c("a-b", "a-c", "b-c"), ]

ggplot(data = delta)+
  facet_wrap(facets = ~contrast, ncol = 1)+
  geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), width=0)+
  geom_point(aes(x = community, y = mean), size = 0.5)+
  theme_bw(base_size = 10)+
  theme(legend.position = "top")+
  ylab(label = expression(delta))+
  scale_x_continuous(expand = c(0,0))

utils::sessionInfo()

R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] patchwork_1.3.0  ggrepel_0.9.6    ggplot2_3.5.1    igraph_2.1.1    
[5] ClustIRR_1.5.0   knitr_1.48       BiocStyle_2.35.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.2           
 [4] loo_2.8.0               Biostrings_2.75.0       fastmap_1.2.0          
 [7] digest_0.6.37           lifecycle_1.0.4         blaster_1.0.7          
[10] pwalign_1.3.0           StanHeaders_2.32.10     magrittr_2.0.3         
[13] compiler_4.5.0          rlang_1.1.4             sass_0.4.9             
[16] tools_4.5.0             utf8_1.2.4              yaml_2.3.10            
[19] labeling_0.4.3          htmlwidgets_1.6.4       pkgbuild_1.4.5         
[22] curl_5.2.3              plyr_1.8.9              withr_3.0.2            
[25] BiocGenerics_0.53.0     grid_4.5.0              stats4_4.5.0           
[28] fansi_1.0.6             colorspace_2.1-1        future_1.34.0          
[31] inline_0.3.19           globals_0.16.3          scales_1.3.0           
[34] tinytex_0.53            cli_3.6.3               rmarkdown_2.28         
[37] crayon_1.5.3            generics_0.1.3          RcppParallel_5.1.9     
[40] stringdist_0.9.12       future.apply_1.11.3     httr_1.4.7             
[43] reshape2_1.4.4          visNetwork_2.1.2        cachem_1.1.0           
[46] rstan_2.32.6            stringr_1.5.1           zlibbioc_1.53.0        
[49] parallel_4.5.0          BiocManager_1.30.25     XVector_0.47.0         
[52] matrixStats_1.4.1       vctrs_0.6.5             V8_6.0.0               
[55] jsonlite_1.8.9          bookdown_0.41           IRanges_2.41.0         
[58] S4Vectors_0.45.0        listenv_0.9.1           magick_2.8.5           
[61] jquerylib_0.1.4         glue_1.8.0              parallelly_1.38.0      
[64] codetools_0.2-20        stringi_1.8.4           gtable_0.3.6           
[67] GenomeInfoDb_1.43.0     QuickJSR_1.4.0          UCSC.utils_1.3.0       
[70] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
[73] htmltools_0.5.8.1       GenomeInfoDbData_1.2.13 R6_2.5.1               
[76] evaluate_1.0.1          highr_0.11              bslib_0.8.0            
[79] rstantools_2.4.0        Rcpp_1.0.13             gridExtra_2.3          
[82] xfun_0.48               pkgconfig_2.0.3

Analysis of T and B cell receptor repertoires with ClustIRR

29 October 2024

Contents

1 Introduction

2 Installation

3 ClustIRR algorithm

3.1 Input

3.2 Step 1. IRR analysis with `cluster_irr`

3.2.1 Theory: how to compute similarities between IR clones?

3.2.2 Example

3.2.3 Annotation of CDR3s

3.3 Step 2. building a graph with `get_graph` or `get_joint_graph`

3.4 Step 3. community detection

3.4.1 Scenario 1: we have CDR3 sequences from one chain, e.g. CDR3\(\beta\)

3.4.2 Scenario 2: we have CDR3 sequences from both chains (paired data)

3.5 Step 4. differential community occupancy (DCO)

3.6 Step 5. posterior predictive check

3.7 Step 6. \(\delta\) parameters

3.8 Conclusion: you can also use custom community occupancy matrix for DCO!

4 Case study: analysis of three TCR repertoires

4.1 Step 1. `cluster_irr` analyzed each TCRs repertoire

4.2 Step 2. `get_graph` and `plot_graph` visualize specificity structures

4.3 Step 3. `detect_communities` identifies communities in the graph

4.4 Step 4. `dco` performs differential community occupancy (DCO) analysis

4.5 Step 5. posterior predictive check

4.6 Step 6. \(\beta\) parameters

4.7 Step 7. \(\delta\) parameters

Analysis of T and B cell receptor repertoires with ClustIRR

29 October 2024

Contents

1 Introduction

2 Installation

3 ClustIRR algorithm

3.1 Input

3.2 Step 1. IRR analysis with cluster_irr

3.2.1 Theory: how to compute similarities between IR clones?

3.2.2 Example

3.2.3 Annotation of CDR3s

3.3 Step 2. building a graph with get_graph or get_joint_graph

3.4 Step 3. community detection

3.4.1 Scenario 1: we have CDR3 sequences from one chain, e.g. CDR3\(\beta\)

3.4.2 Scenario 2: we have CDR3 sequences from both chains (paired data)

3.5 Step 4. differential community occupancy (DCO)

3.6 Step 5. posterior predictive check

3.7 Step 6. \(\delta\) parameters

3.8 Conclusion: you can also use custom community occupancy matrix for DCO!

4 Case study: analysis of three TCR repertoires

4.1 Step 1. cluster_irr analyzed each TCRs repertoire

4.2 Step 2. get_graph and plot_graph visualize specificity structures

4.3 Step 3. detect_communities identifies communities in the graph

4.4 Step 4. dco performs differential community occupancy (DCO) analysis

4.5 Step 5. posterior predictive check

4.6 Step 6. \(\beta\) parameters

4.7 Step 7. \(\delta\) parameters

3.2 Step 1. IRR analysis with `cluster_irr`

3.3 Step 2. building a graph with `get_graph` or `get_joint_graph`

4.1 Step 1. `cluster_irr` analyzed each TCRs repertoire

4.2 Step 2. `get_graph` and `plot_graph` visualize specificity structures

4.3 Step 3. `detect_communities` identifies communities in the graph

4.4 Step 4. `dco` performs differential community occupancy (DCO) analysis