1 VAR-Seq Workflow

This workflow demonstrates how to use various utilities for building and running automated end-to-end analysis workflows for VAR-Seq data. The full workflow can be found here: HTML, .Rmd, and .R.

1.1 Loading package and workflow template

Load the VAR-Seq sample workflow into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow = "varseq")
setwd("varseq")

The working environment of the sample data loaded in the previous step contains the following preconfigured directory structure. Directory names are indicated in grey. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.

  • varseq/
    • This is the directory of the R session running the workflow.
    • Run script ( *.Rmd) and sample annotation (targets.txt) files are located here.
    • Note, this directory can have any name (e.g. varseq). Changing its name does not require any modifications in the run script(s).
    • Important subdirectories:
      • param/
        • Stores parameter files such as: *.param, *.tmpl and *_run.sh.
      • data/
        • FASTQ samples
        • Reference FASTA file
        • Annotations
        • etc.
      • results/
        • Alignment, variant and peak files (BAM, VCF, BED)
        • Tabular result files
        • Images and plots
        • etc.

The following parameter files are included in each workflow template:

  1. targets.txt: initial one provided by user; downstream targets_*.txt files are generated automatically
  2. *.param: defines parameter for input/output file operations, e.g. trim.param, bwa.param, hisat2.param, …
  3. *_run.sh: optional bash script, e.g.: gatk_run.sh
  4. Compute cluster environment (skip on single machine):
    • .batchtools.conf.R: defines type of scheduler for batchtools. Note, it is necessary to point the right template accordingly to the cluster in use.
    • *.tmpl: specifies parameters of scheduler used by a system, e.g. Torque, SGE, Slurm, etc.

1.2 Run workflow

Next, run the chosen sample workflow systemPipeVARseq (.Rmd) by executing from the command-line make -B within the varseq directory. Alternatively, one can run the code from the provided *.Rmd template file from within R interactively.

Workflow includes following steps:

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: gsnap, bwa
  3. Variant calling: VariantTools, GATK, BCFtools
  4. Variant filtering: VariantTools and VariantAnnotation
  5. Variant annotation: VariantAnnotation
  6. Combine results from many samples
  7. Summary statistics of samples

2 Version Information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices
## [6] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DESeq2_1.24.0               batchtools_0.9.11          
##  [3] data.table_1.12.2           ape_5.3                    
##  [5] ggplot2_3.2.0               systemPipeR_1.18.2         
##  [7] ShortRead_1.42.0            GenomicAlignments_1.20.1   
##  [9] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
## [11] matrixStats_0.54.0          Biobase_2.44.0             
## [13] BiocParallel_1.18.0         Rsamtools_2.0.0            
## [15] Biostrings_2.52.0           XVector_0.24.0             
## [17] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
## [19] IRanges_2.18.1              S4Vectors_0.22.0           
## [21] BiocGenerics_0.30.0         BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1         rjson_0.2.20            
##  [3] hwriter_1.3.2            htmlTable_1.13.1        
##  [5] base64enc_0.1-3          rstudioapi_0.10         
##  [7] bit64_0.9-7              AnnotationDbi_1.46.0    
##  [9] codetools_0.2-16         splines_3.6.0           
## [11] geneplotter_1.62.0       knitr_1.23              
## [13] Formula_1.2-3            annotate_1.62.0         
## [15] cluster_2.1.0            GO.db_3.8.2             
## [17] pheatmap_1.0.12          graph_1.62.0            
## [19] BiocManager_1.30.4       compiler_3.6.0          
## [21] httr_1.4.0               GOstats_2.50.0          
## [23] backports_1.1.4          assertthat_0.2.1        
## [25] Matrix_1.2-17            lazyeval_0.2.2          
## [27] limma_3.40.2             formatR_1.7             
## [29] acepack_1.4.1            htmltools_0.3.6         
## [31] prettyunits_1.0.2        tools_3.6.0             
## [33] gtable_0.3.0             glue_1.3.1              
## [35] GenomeInfoDbData_1.2.1   Category_2.50.0         
## [37] dplyr_0.8.1              rappdirs_0.3.1          
## [39] Rcpp_1.0.1               nlme_3.1-140            
## [41] rtracklayer_1.44.0       xfun_0.7                
## [43] stringr_1.4.0            XML_3.98-1.20           
## [45] edgeR_3.26.5             zlibbioc_1.30.0         
## [47] scales_1.0.0             BSgenome_1.52.0         
## [49] VariantAnnotation_1.30.1 hms_0.4.2               
## [51] RBGL_1.60.0              RColorBrewer_1.1-2      
## [53] yaml_2.2.0               memoise_1.1.0           
## [55] gridExtra_2.3            biomaRt_2.40.0          
## [57] rpart_4.1-15             latticeExtra_0.6-28     
## [59] stringi_1.4.3            RSQLite_2.1.1           
## [61] genefilter_1.66.0        checkmate_1.9.3         
## [63] GenomicFeatures_1.36.2   rlang_0.3.4             
## [65] pkgconfig_2.0.2          bitops_1.0-6            
## [67] evaluate_0.14            lattice_0.20-38         
## [69] purrr_0.3.2              labeling_0.3            
## [71] htmlwidgets_1.3          bit_1.1-14              
## [73] tidyselect_0.2.5         GSEABase_1.46.0         
## [75] AnnotationForge_1.26.0   magrittr_1.5            
## [77] bookdown_0.11            R6_2.4.0                
## [79] Hmisc_4.2-0              base64url_1.4           
## [81] DBI_1.0.0                pillar_1.4.1            
## [83] foreign_0.8-71           withr_2.1.2             
## [85] survival_2.44-1.1        RCurl_1.95-4.12         
## [87] nnet_7.3-12              tibble_2.1.3            
## [89] crayon_1.3.4             rmarkdown_1.13          
## [91] progress_1.2.2           locfit_1.5-9.1          
## [93] grid_3.6.0               blob_1.1.1              
## [95] Rgraphviz_2.28.0         digest_0.6.19           
## [97] xtable_1.8-4             brew_1.0-6              
## [99] munsell_0.5.0

3 Funding

This project was supported by funds from the National Institutes of Health (NIH) and the National Science Foundation (NSF).