tripr 1.0.0
tripr
is a Bioconductor package,
written in shiny that provides
analytics services on
antigen receptor (B cell receptor immunoglobulin, BcR IG | T cell receptor,
TR) gene sequence data. Every step of the analysis can be
performed interactively, thus not requiring any programming skills. It takes
as input the output files of the
IMGT/HighV-Quest tool.
Users can select to analyze the data from each of the input samples separately,
or the combined data files from all samples and visualize the results
accordingly. Functions for an R
command-line use are also available.
tripr
is distributed as a Bioconductor
package and requires R
(version “4.1”), which can be installed on any
operating system from CRAN, and
Bioconductor (version “3.14”).
To install tripr
package enter the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("tripr")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
Once tripr
is successfully installed, it can be loaded as follow:
library(tripr)
tripr
as a shiny
applicationIn order to start the shiny
app, please run the following command:
tripr::run_app()
tripr
should be opening in a browser (ideally Chrome, Firefox or Opera).
If this does not happen automatically,
please open a browser and navigate to the address shown on the R
console
(for example, Listening on http://127.0.0.1:6134
).
In this tab users can import their data by selecting the directory where the data is stored, by pressing the Choose directory button. The tool takes as input the 10 output files of the IMGT/HighV-Quest tool in text format (.txt). Users can also choose only some of the files depending on the type of the downstream analysis.
Note that every sample of the dataset must have its own individual folder and every sample folder must be in one root folder (See example below). For the dataset to be selected for upload, this root folder must be selected and then the button Load Data has to be pressed.
Previous sessions can also be loaded with the Restore Previous Sessions button.
There are 2 options regarding the cell type (T cell and B cell) as well as 2 options based on the amount of available data (High- or Low-Throughput). Concerning the latter, the main difference is the application of the preselection and selection steps. In the case of High-Throughput data, all filters are applied consequentially (i.e. if a sequence fails >1 selection criteria, only the first unsatisfied criterion will be reported), whereas for Low-Throughput data all criteria are applied at the same time.
tripr
offers 2 steps of preprocessing:
Preselection: Refers to the cleaning process of the input dataset.
Selection: Refers to the filtering process of the resulting data from Preselection process.
The Preselection process comprises 4 different criteria:
The execution starts when the Apply button is pressed.
Users can visualize the results of the preselection (first cleaning) process in the Preselection tab. In the case of multi-sample datasets, results are provided for each individual sample separately, or for the combined dataset by scrolling through the Select Dataset option.
The output consists of 4 table files:
The figure below shows an example Clean table from this Tab.
All 4 tables can be downloaded as text files.
The sequences that passed through the Preselection process (“Clean table”) are used as input for the data Selection (filtering) process.
This step comprises 6 different filters:
Using the above 3 filters the user can select for sequences that carry one or more particular V, J and D genes or gene alleles, respectively. Different genes/gene alleles should be separated with a vertical line (|), e.g. TRBV11-2|TRBV29-1*03.
The execution starts when the Execute button is pressed.
The results of the Selection (filtering) process are presented in the Selection tab.
This process provides 4 output files:
All the tables can be downloaded as text files.
Users can select the workflow that they want to apply to their dataset(s).
There are 11 different tools in the pipeline tab. 7 of them can be applied for both T- and B-cells, while the remaining 4 can be applied only for B-cells.
Step Dependencies in Pipeline
For both T- and B-cells:
The frequencies for all unique clonotypes of each sample are computed. There are 10 different options for clonotype definition.
The results are presented in the Clonotypes tab in the form of a table, where the clonotype, the count, the frequency and the convergent evolution (if feasible) are given. Each clonotype is also a link that provides a table with all relevant immunogenetic data for that particular clonotype, based on the uploaded files. This table consists of all reads/sequences assigned to that clonotype and all relevant information. Each clonotype is given a unique cluster identifier (cluster ID).
Frequencies for all highly similar clonotypes are computed. The user can set the number of mismatches allowed for each CDR3 length found in the dataset and a clonotype frequency threshold (range: 0-1). Only clonotypes with a frequency above the applied threshold will be used in the subsequent grouping. The whole process can be performed with or without taking into account the rearranged V-gene.
The results are presented in the Highly Similar Clonotypes tab as a table. A second table is also provided containing information regarding the clonotype grouping.
The number of clonotypes using each V, J or D gene/allele is computed over the total number of clonotypes based on the clonotype definition given in the previous Clonotype computation step. If multiple samples are analyzed together the tool provides a total repertoire as well as the repertoire for each individual sample.
Results are provided in the Repertoires tab as tables. Each table includes the gene/allele and information concerning the absolute count and frequency of sequences expressing that particular gene/allele.
Same as above except for the fact that the tool uses as input the clonotypes as computed in the Highly Similar Clonotypes computation.
The tool performs cross-tabulation analysis between 2 selected variables. Many different variables can be selected by the user for this type of analysis depending on the selected input files from the Home tab.
The results are presented at the Multiple value comparison tab as tables. Each table contains the values that were found to be associated and the relevant frequency.
This tool can be applied for datasets that consist of sequences with highly similar CDR3. The tool is able to align and create sequence logos for sequences with the same length as well as for sequences that differ by a single amino acid in terms of length.
This tool creates an amino acid frequency table for the selected sequence region (CDR3, VDJ REGION, VJ REGION) of a given length. The frequency table is computed by counting the frequency of appearance of each of the 20 different amino acids at any given position of the sequence. The users have the option to select over the total frequency table or the table of the top clusters according to the clonotype frequencies.
A logo is created using the above frequency table. The color code of the amino acids is created based on the 11 IMGT amino acid physicochemical classes.
Only for B cells:
Input sequences are grouped into different categories based on the V-region identity percent. The user can determine the number and the identity percent range of mutational groups. (high limit: <, low limit: ≥)
The relative frequency of each germline identity group is computed. If the user has not defined any groups based on the somatic hypermutation (SHM) status using the Insert identity groups tool, the tool will group together only sequences that display the exact SHM status (e.g. sequences with an identity percent of 98.6% will be grouped together whereas sequences with 98.7% identity will form a distinct group). Relative frequencies for each SHM group will be computed based on the total number of sequences.
An alignment table is created for the user-selected region (VDJ REGION, VJ REGION). Sequences that are identical in terms of amino acid or nucleotide sequence level are grouped together in order to create the grouped alignment table. Alignments for the selected region can be provided at the nucleotide or amino acid level or both. Default reference sequences are extracted from the IMGT reference directory. Reference sequences can be used either at the gene or gene allele level. At the gene level, allele *01 is considered as reference. Users can also submit their own reference sequence. There is also the possibility to align only a number of selected clonotypes through the Select topN clonotype option or select those clonotypes that have an individual frequency above a given percent cutoff.
Results are presented in the Alignment tab as tables.
Each table can be downloaded in txt format.
A table with all somatic hypermutations for all samples together as well as for each individual sample is computed based on the alignment table provided by the previous tool.
The output table includes:
There is the possibility to analyze only a number of clonotypes by choosing the Select topN clonotypes or the Select threshold for clonotypes option or even some clonotypes separately by choosing the Select clonotypes separately option. Different clonotype/cluster identifiers (cluster IDs) should be separated by comma (e.g. 1,3,7).
Results are given in the Mutations tab as tables. When different clonotypes are selected separately, different tables are created for each given clonotype.
Each table can be downloaded in text format.
In the Visualization tab different types of charts (scatter, plots, bars etc.) are available for the visualization of the analysis results. Clonotypes are presented as bars and the user can select the frequency above which the clonotypes will be presented.
The convergent evolution is also available for visualization with more than one chart type options.
The computed repertoires are presented as pie-charts and the user can again select the minimum frequency of the gene/allele that will be presented.
Regarding the Multiple value comparison tool, a plot of the 2 selected variables is presented.
All the tables that are presented to the user can be downloaded in text format, whereas the plots and the graphics can be downloaded in .png format.
This section provides an overview of the user’s total options for the analysis.
tripr
via R
command lineAs mentioned before, tripr
can also be used via R
command line with
the run_TRIP()
function.
run_TRIP()
works as a wrapper function for the analysis that tripr
provides. To see its detailed documentation write:
?tripr::run_TRIP
Some of its most important arguments:
datapath
: The path to the directory where data is located.
Note that every sample of the dataset must have its
own individual folder
and every sample folder must be in one root folder.
Note that every file in the root folder will be used in
the analysis.
Supposedly the dataset is in user’s Documents folder, one
could use: fs::path_home("Documents", "dataset")
, with the help of
fs package.
The default value is
fs::path_package("extdata", "dataset", package = "tripr")
which uses the example dataset of 2 B-cell samples.
output_path
: The directory where the output data will
be stored. Please provide a valid path, ideally the same way as datapath
by using the fs package.
The default value points to Documents/tripr_output directory.
filelist
: The character vector of files of the
IMGT/HighV-Quest tool
output that will be used through the analysis.
The default value is
c("1_Summary.txt", "2_IMGT-gapped-nt-sequences.txt",
"4_IMGT-gapped-AA-sequences.txt", "6_Junction.txt")
which uses only 4 of the 10 .txt files that the IMGT/HighV-Quest tool tool provides as output.
preselection
: Preselection Options (1:4). See Preselection
selection
: Selection Options (5:10). See Selection
pipeline
: Pipeline Options (1:19). The user can select multiple pipelines
by seperating them with comma ‘,’.
See Pipelines and run ?tripr::run_TRIP
for more details.
Every output of tripr
analysis with run_TRIP()
function will be stored in
the output_path
directory as mentioned before. Therefore, no table or plot
will be presented through RStudio
or any other graphics device when the
analysis is run, on contrary with the shiny
app, where the user has access
to output tables and plots via the User Interface.
Output Directory contains two folders:
The output directory has a unique name for every analysis, that points to the system time that it was run.
run_TRIP()
An example of run_TRIP()
analysis, using the example dataset of 2 B-cells
that is provided, is presented below.
datapath <- fs::path_package("extdata/dataset", package="tripr")
output_path <- file.path(tempdir(), "myoutput")
cell <- "Bcell"
preselection <- "1,2,3,4C:W"
selection <- "5"
filelist <- c("1_Summary.txt",
"2_IMGT-gapped-nt-sequences.txt",
"4_IMGT-gapped-AA-sequences.txt",
"6_Junction.txt")
throughput <- "High Throughput"
preselection <- "1,2,3,4C:W"
selection <- "5"
identity_range <- "88:100"
pipeline <- "1"
select_clonotype <- "V Gene + CDR3 Amino Acids"
run_TRIP(
datapath=datapath,
output_path=output_path,
filelist=filelist,
cell=cell,
throughput=throughput,
preselection=preselection,
selection=selection,
identity_range=identity_range,
pipeline=pipeline,
select_clonotype=select_clonotype)
#> png
#> 2
The tripr package was made possible thanks to:
We hope that tripr will be useful for your research. Please use the following information to cite the package and the research article. Thank you!
## Citation info
citation("tripr")
#>
#> To cite tripr in publications use:
#>
#> Kotouza, M.T., Gemenetzi, K., Galigalidou, C. et al. TRIP - T cell
#> receptor/immunoglobulin profiler. BMC Bioinformatics 21, 422 (2020).
#> https://doi.org/10.1186/s12859-020-03669-1
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {T-cell Receptor/Immunoglobulin Profiler (TRIP)},
#> author = {Maria Th. Kotouza and Katerina Gemenetzi and Chrysi Galigalidou and Elisavet Vlachonikola and Nikolaos Pechlivanis and Andreas Agathangelidis and Raphael Sandaltzopoulos and Pericles A. Mitkas and Kostas Stamatopoulos and Anastasia Chatzidimitriou and Fotis E. Psomopoulos},
#> journal = {BMC Bioinformatics},
#> year = {2020},
#> volume = {21},
#> number = {422},
#> pages = {-},
#> url = {https://doi.org/10.1186/s12859-020-03669-1},
#> }
Here is the output of sessionInfo()
on the system on which this document was
compiled running pandoc 2.5
:
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tripr_1.0.0 shinyBS_0.61 shiny_1.7.1 RefManageR_1.3.0
#> [5] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.2 shinyFiles_0.9.0 tidyr_1.1.4
#> [4] sass_0.4.0 pkgload_1.2.3 viridisLite_0.4.0
#> [7] jsonlite_1.7.2 bslib_0.3.1 assertthat_0.2.1
#> [10] BiocManager_1.30.16 highr_0.9 yaml_2.2.1
#> [13] remotes_2.4.1 pillar_1.6.4 glue_1.4.2
#> [16] digest_0.6.28 pryr_0.1.5 RColorBrewer_1.1-2
#> [19] promises_1.2.0.1 colorspace_2.0-2 htmltools_0.5.2
#> [22] httpuv_1.6.3 plyr_1.8.6 pkgconfig_2.0.3
#> [25] misc3d_0.9-1 bookdown_0.24 config_0.3.1
#> [28] purrr_0.3.4 xtable_1.8-4 scales_1.1.1
#> [31] processx_3.5.2 later_1.3.0 tibble_3.1.5
#> [34] generics_0.1.1 ggplot2_3.3.5 usethis_2.1.2
#> [37] ellipsis_0.3.2 shinyjs_2.0.0 withr_2.4.2
#> [40] lazyeval_0.2.2 cli_3.0.1 magrittr_2.0.1
#> [43] crayon_1.4.1 mime_0.12 evaluate_0.14
#> [46] ps_1.6.0 golem_0.3.1 fs_1.5.0
#> [49] dockerfiler_0.1.4 fansi_0.5.0 xml2_1.3.2
#> [52] pkgbuild_1.2.0 tools_4.1.1 data.table_1.14.2
#> [55] prettyunits_1.1.1 lifecycle_1.0.1 stringr_1.4.0
#> [58] plotly_4.10.0 munsell_0.5.0 callr_3.7.0
#> [61] compiler_4.1.1 jquerylib_0.1.4 rlang_0.4.12
#> [64] plot3D_1.4 grid_4.1.1 attempt_0.3.1
#> [67] rstudioapi_0.13 htmlwidgets_1.5.4 tcltk_4.1.1
#> [70] rmarkdown_2.11 testthat_3.1.0 codetools_0.2-18
#> [73] gtable_0.3.0 DBI_1.1.1 roxygen2_7.1.2
#> [76] R6_2.5.1 gridExtra_2.3 lubridate_1.8.0
#> [79] knitr_1.36 dplyr_1.0.7 fastmap_1.1.0
#> [82] utf8_1.2.2 rprojroot_2.0.2 desc_1.4.0
#> [85] stringi_1.7.5 parallel_4.1.1 Rcpp_1.0.7
#> [88] vctrs_0.3.8 tidyselect_1.1.1 xfun_0.27
This vignette was generated using BiocStyle (Oleś, 2021), knitr (Xie, 2014) and rmarkdown (Allaire, Xie, McPherson, et al., 2021) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.11. 2021. URL: https://github.com/rstudio/rmarkdown.
[2] D. Attali. shinyjs: Easily Improve the User Experience of Your Shiny Apps in Seconds. R package version 2.0.0. 2020. URL: https://CRAN.R-project.org/package=shinyjs.
[3] B. Auguie. gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. 2017. URL: https://CRAN.R-project.org/package=gridExtra.
[4] E. Bailey. shinyBS: Twitter Bootstrap Components for Shiny. R package version 0.61. 2015. URL: https://CRAN.R-project.org/package=shinyBS.
[5] W. Chang, J. Cheng, J. Allaire, et al. shiny: Web Application Framework for R. R package version 1.7.1. 2021. URL: https://CRAN.R-project.org/package=shiny.
[6] L. Collado-Torres. Automate package and project setup for Bioconductor packages. https://github.com/lcolladotor/biocthisbiocthis - R package version 1.4.0. 2021. DOI: 10.18129/B9.bioc.biocthis. URL: http://www.bioconductor.org/packages/biocthis.
[7] M. Dowle and A. Srinivasan. data.table: Extension of ‘data.frame’. R package version 1.14.2. 2021. URL: https://CRAN.R-project.org/package=data.table.
[8] C. Fay, V. Guyader, S. Rochette, et al. golem: A Framework for Robust Shiny Applications. R package version 0.3.1. 2021. URL: https://CRAN.R-project.org/package=golem.
[9] J. Hester and H. Wickham. fs: Cross-Platform File System Operations Based on ‘libuv’. R package version 1.5.0. 2020. URL: https://CRAN.R-project.org/package=fs.
[10] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[11] E. Neuwirth. RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. 2014. URL: https://CRAN.R-project.org/package=RColorBrewer.
[12] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.22.0. 2021. URL: https://github.com/Bioconductor/BiocStyle.
[13] T. Pedersen, V. Nijs, T. Schaffner, et al. shinyFiles: A Server-Side File System Viewer for Shiny. R package version 0.9.0. 2020. URL: https://CRAN.R-project.org/package=shinyFiles.
[14] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021. URL: https://www.R-project.org/.
[15] C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC, 2020. ISBN: 9781138331457. URL: https://plotly-r.com.
[16] K. Soetaert. plot3D: Plotting Multi-Dimensional Data. R package version 1.4. 2021. URL: https://CRAN.R-project.org/package=plot3D.
[17] H. Wickham. “The Split-Apply-Combine Strategy for Data Analysis”. In: Journal of Statistical Software 40.1 (2011), pp. 1–29. URL: http://www.jstatsoft.org/v40/i01/.
[18] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. URL: https://ggplot2.tidyverse.org.
[19] H. Wickham. pryr: Tools for Computing on the Language. R package version 0.1.5. 2021. URL: https://CRAN.R-project.org/package=pryr.
[20] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. 2019. URL: https://CRAN.R-project.org/package=stringr.
[21] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[22] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.0.7. 2021. URL: https://CRAN.R-project.org/package=dplyr.
[23] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.
[24] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.19. 2021. URL: https://CRAN.R-project.org/package=DT.
[25] M. van der Loo. “The stringdist package for approximate string matching”. In: The R Journal 6 (1 2014), pp. 111-122. URL: https://CRAN.R-project.org/package=stringdist.