1 Getting Started

1.1 Data

You’ll need BAM file(s) for your wild-type pool, BAM file(s) for your mutant pool, and the reference genome for your species in fasta format. We recommend that each pool contain at least 20 individuals to ensure a good number of recombinations to measure.

1.2 Installing Dependencies

MMAPPR2 depends on two system tools to function: Samtools and VEP. Both must be installed and in the PATH to be found by the appropriate functions.

1.2.1 Installing Samtools

Instructions to install samtools can be found at https://github.com/samtools/samtools and installation instructions are in the INSTALL file included with samtools.

1.2.2 Installing VEP

You’ll need Ensembl VEP, which you can install like this, replacing my_species with your species (e.g., danio_rerio):

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl -a ac -s {my_species}

This installs the most recent VEP and allows you to create a cache for your desired species, which is what MMAPPR2 expects by default. If you depart from the installation shown here, or if things don’t go smoothly, see Ensembl’s instructions and make sure any differences are accounted for in the VEPFlags object.

Note: If you have any trouble installing VEP, using their Docker image may save you a lot of hassle.

Note: We have found that R sometimes has issues finding VEP, especially when perlbrew is used. If you encounter errors at the path to your perl installation to the .Rprofile file. For example:

Sys.setenv(PATH=paste("/Path/to/Perlbrew", Sys.getenv("PATH"), sep=":"))

2 Basic Use

2.1 Setting Parameters

For our example, we will use just the golden gene from the GRCz11 zebrafish reference genome.

Here we also configure the VEPFlags object to use our example fasta and GTF files. See below for more info.

Make sure your reference genome is the same you’ll use with VEP! This will be the most recent assembly available on Ensembl unless you customize. You should use the same genome in aligning your sequencing data as well.

BiocParallel::register(BiocParallel::MulticoreParam())  ## see below for explanation of BiocParallel
library(MMAPPR2, quietly = TRUE)
library(MMAPPR2data, quietly = TRUE)
library(Rsamtools, quietly = TRUE)
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:MMAPPR2':
## 
##     species, species<-
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:MMAPPR2':
## 
##     distance
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
# This is normally configured automatically:
vepFlags <- ensemblVEP::VEPFlags(flags = list(
    format = 'vcf',  # <-- this is necessary
    vcf = FALSE,  # <-- as well as this
    species = 'danio_rerio',
    database = FALSE,  # <-- these three arguments allow us to run VEP offline,
    fasta = goldenFasta(),  # <-╯|  which you probably won't need
    gff = goldenGFF(),  # <------╯
    filter_common = TRUE,
    coding_only = TRUE  # assuming RNA-seq data
))

param <- MmapprParam(refFasta = goldenFasta(),
                     wtFiles = exampleWTbam(),
                     mutFiles = exampleMutBam(),
                     species = 'danio_rerio',
                     vepFlags = vepFlags,  ## optional
                     outputFolder = tempOutputFolder())  ## optional
## NOTE: genome 'danio_rerio' already exists, not overwriting

2.2 Running MMAPPR2

With parameters set, running the pipeline should be as simple as the following:

mmapprData <- mmappr(param)
## ------------------------------------
## -------- Welcome to MMAPPR2 --------
## ------------------------------------
## Start time: 2022-04-26 17:11:39
## Output folder: /tmp/RtmpZFm2nq/Rbuild75fb44bb0620/MMAPPR2/vignettes//tmp/RtmpliLD14/mmappr2_2022-04-26_17:11:39
## Reading BAM files and generating Euclidean distance data...
## Generating optimal Loess curves for each chromosome...
## Identifying chromosome(s) harboring linkage region...
## Peak regions succesfully identified
## Refining peak characterization using SNP resampling...
## Generating, analyzing, and ranking candidate variants...
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence 18.
##   Note that ranges located on a sequence whose length is unknown (NA) or
##   on a circular sequence are not considered out-of-bound (use
##   seqlengths() and isCircular() to get the lengths and circularity flags
##   of the underlying sequences). You can use trim() to trim these ranges.
##   See ?`trim,GenomicRanges-method` for more information.
## Warning in .Call(.make_vcf_geno, filename, fixed, names(geno), as.list(geno), :
## converting NULL pointer to R NULL
## Writing output plots and tables...
## 
## End time: 2022-04-26 17:11:51
## MMAPPR2 runtime: 11.93783 secs

The MMAPPR2 pipeline can also be run a step at a time:

md <- new('MmapprData', param = param) ## calculateDistance() takes a MmapprData object
postCalcDistMD <- calculateDistance(md)
postLoessMD <- loessFit(postCalcDistMD)
postPrePeakMD <- prePeak(postLoessMD)
postPeakRefMD <- peakRefinement(postPrePeakMD)
postCandidatesMD <- generateCandidates(postPeakRefMD)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence 18.
##   Note that ranges located on a sequence whose length is unknown (NA) or
##   on a circular sequence are not considered out-of-bound (use
##   seqlengths() and isCircular() to get the lengths and circularity flags
##   of the underlying sequences). You can use trim() to trim these ranges.
##   See ?`trim,GenomicRanges-method` for more information.
## Warning in .Call(.make_vcf_geno, filename, fixed, names(geno), as.list(geno), :
## converting NULL pointer to R NULL
outputMmapprData(postCandidatesMD)

If the pipeline fails midway, the MmapprData object is saved, which you can then load and use for inspection and debugging:

## Contents of output folder:
cat(paste(system2('ls', outputFolder(param(mmapprData)), stdout = TRUE)), sep = '\n')
## 18.tsv
## genome_plots.pdf
## mmappr2.log
## mmappr_data.RDS
## peak_plots.pdf
mdFile <- file.path(outputFolder(param(mmapprData)), 'mmappr_data.RDS')
md <- readRDS(mdFile)
md
## MmapprData object with following slots:
## param:
##   MmapprParam object with following values:
##   Reference fasta file:
##      /home/biocbuild/bbs-3.15-bioc/R/library/MMAPPR2data/extdata/slc24a5.fa.gz
##   wtFiles:
##      BamFileList of length 1
##      names(1): wt.bam
##   mutFiles:
##      BamFileList of length 1
##      names(1): mut.bam
##   vepFlags:
##      class: VEPFlags 
##      flags(6): format, species, ..., filter_common, coding_only
##      version: 105
##      scriptPath:  
##   refGenome:
##      GmapGenome object
##      genome: danio_rerio 
##      directory: /home/biocbuild/.local/share/gmap 
##   Other parameters:
##                                       species 
##                                   danio_rerio 
##                                 distancePower 
##                                             4 
##                             peakIntervalWidth 
##                                          0.95 
##                                      minDepth 
##                                            10 
##                              homozygoteCutoff 
##                                          0.95 
##                                minBaseQuality 
##                                            20 
##                                 minMapQuality 
##                                            30 
##                            loessOptResolution 
##                                         0.001 
##                             loessOptCutFactor 
##                                           0.1 
##                                      naCutoff 
##                                             0 
##                                  outputFolder 
##   /tmp/RtmpliLD14/mmappr2_2022-04-26_17:11:39 
##                               fileAggregation 
##                                           sum 
## distance:
##   Contains Euclidian distance data for 1 sequence(s)
##   and Loess regression data for 1 of those
##   Memory usage = 1 MB
## peaks:
##   18: start = -140, end = 14091
##     Density function calculated
## candidates:
##   $`18`
##   GRanges object with 12 ranges and 30 metadata columns:
##                  seqnames    ranges strand |      Allele      Consequence
##                     <Rle> <IRanges>  <Rle> | <character>      <character>
##      18:5744_A/C       18      5744      * |           C missense_variant

2.3 Results

If everything goes well you should be able to track down your mutation using the candidates slot of your MmapprData object or by looking at files in the output folder:

head(candidates(mmapprData)$`18`, n=2)
## GRanges object with 2 ranges and 30 metadata columns:
##               seqnames    ranges strand |      Allele      Consequence
##                  <Rle> <IRanges>  <Rle> | <character>      <character>
##   18:5744_A/C       18      5744      * |           C missense_variant
##   18:5736_T/C       18      5736      * |           C missense_variant
##                    IMPACT             SYMBOL               Gene Feature_type
##               <character>        <character>        <character>  <character>
##   18:5744_A/C    MODERATE ENSDARG00000024771 ENSDARG00000024771   Transcript
##   18:5736_T/C    MODERATE ENSDARG00000024771 ENSDARG00000024771   Transcript
##                          Feature        BIOTYPE        EXON      INTRON
##                      <character>    <character> <character> <character>
##   18:5744_A/C ENSDART00000033574 protein_coding         6/9        <NA>
##   18:5736_T/C ENSDART00000033574 protein_coding         6/9        <NA>
##                     HGVSc       HGVSp cDNA_position CDS_position
##               <character> <character>   <character>  <character>
##   18:5744_A/C        <NA>        <NA>           940          880
##   18:5736_T/C        <NA>        <NA>           932          872
##               Protein_position Amino_acids      Codons Existing_variation
##                    <character> <character> <character>        <character>
##   18:5744_A/C              294         S/R     Agc/Cgc               <NA>
##   18:5736_T/C              291         L/P     cTa/cCa               <NA>
##                  DISTANCE      STRAND       FLAGS SYMBOL_SOURCE     HGNC_ID
##               <character> <character> <character>   <character> <character>
##   18:5744_A/C        <NA>           1        <NA>          <NA>        <NA>
##   18:5736_T/C        <NA>           1        <NA>          <NA>        <NA>
##                       SOURCE       FREQS    CLIN_SIG     SOMATIC       PHENO
##                  <character> <character> <character> <character> <character>
##   18:5744_A/C slc24a5.gff.gz        <NA>        <NA>        <NA>        <NA>
##   18:5736_T/C slc24a5.gff.gz        <NA>        <NA>        <NA>        <NA>
##               slc24a5.gff.gz peakDensity
##                  <character>   <numeric>
##   18:5744_A/C           <NA> 4.69571e-05
##   18:5736_T/C           <NA> 4.63563e-05
##   -------
##   seqinfo: 1 sequence from  genome
outputTsv <- file.path(outputFolder(param(mmapprData)), '18.tsv')
cat(paste(system2('head', outputTsv, stdout = TRUE)), sep = '\n')
## Position Symbol  Impact  Consequence DensityScore    Allele  AminoAcid   Feature
## 5744 ENSDARG00000024771  MODERATE    missense_variant    4.69570920024265e-05    C   S/R ENSDART00000033574
## 5736 ENSDARG00000024771  MODERATE    missense_variant    4.63563337842703e-05    C   L/P ENSDART00000033574
## 5706 ENSDARG00000024771  MODERATE    missense_variant    4.3464007193086e-05 C   I/T ENSDART00000033574
## 4117 ENSDARG00000024771  MODERATE    missense_variant    4.30949569470718e-05    C   V/L ENSDART00000033574
## 5494 ENSDARG00000024771  HIGH    stop_gained 3.26854376246832e-05    G   Y/* ENSDART00000033574
## 5601 ENSDARG00000024771  MODERATE    missense_variant    3.17179280841815e-05    C   C/S ENSDART00000033574
## 7197 ENSDARG00000024771  MODERATE    missense_variant    2.54610321775923e-05    C   M/I ENSDART00000033574
## 7244 ENSDARG00000024771  MODERATE    missense_variant    2.07444414554597e-05    C   V/A ENSDART00000033574
## 7336 ENSDARG00000024771  MODERATE    missense_variant    1.15600976934635e-05    C   I/L ENSDART00000033574

3 Advanced Configuration

3.1 Configure VEPFlags Object

MMAPPR2 uses the ensemblVEP Bioconductor package to predict the effect of variants in the peak region. To customize this process, you’ll need to configure a VEPFlags object. Look at Ensembl’s website for script options. You can configure the VEPFlags object like this:

library(ensemblVEP, quietly = TRUE)
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:base':
## 
##     tabulate
## 
## Attaching package: 'ensemblVEP'
## The following object is masked from 'package:Biobase':
## 
##     cache
## The following object is masked from 'package:BiocStyle':
## 
##     output
vepFlags <- VEPFlags(flags = list(
    ### DEFAULT SETTINGS
    format = 'vcf',  # <-- this is necessary
    vcf = FALSE,  # <-- as well as this
    species = 'danio_rerio',
    database = FALSE,
    cache = TRUE,
    filter_common = TRUE,
    coding_only = TRUE  # assuming RNA-seq data
    ### YOU MAY FIND THESE INTERESTING:
    # everything = TRUE  # enables many optional analyses, such as Polyphen and SIFT
    # per_gene = TRUE  # will output only the most severe variant per gene
    # pick = TRUE  # will output only one consequence per variant
))

3.2 BiocParallel Configuration

MMAPPR2 simply uses the default bpparam registered. You can change this (for example, if BiocParallel isn’t working correctly) with the BiocParallel::register command. For example:

library(BiocParallel, quietly = TRUE)
register(SerialParam())
register(MulticoreParam(progressbar=TRUE))
registered()
## $MulticoreParam
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 4; bptasks: 2147483647; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: NA; bpprogressbar: TRUE
##   bpexportglobals: TRUE; bpexportvariables: FALSE; bpforceGC: TRUE; bpfallback: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
## 
## $SerialParam
## class: SerialParam
##   bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
##   bpexportglobals: FALSE; bpexportvariables: FALSE; bpforceGC: FALSE; bpfallback: FALSE
##   bplogdir: NA
##   bpresultdir: NA
## 
## $SnowParam
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpexportvariables: TRUE; bpforceGC: FALSE; bpfallback: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK

The last param registered becomes the default.

3.3 Reference Genome

The variant calling step requires a BiocStyle::Biocpkg("gmapR") GmapGenome, which is normally automatically generated from the refFasta parameter. If for some reason you want to generate your own, the process is like this:

refGenome <- gmapR::GmapGenome(goldenFasta(), name='slc24a5', create=TRUE)

3.4 Whole-genome Sequencing (WGS)

MMAPPR2, like its predecessor, was designed for and tested using RNA-Seq data. However, the principles at work should still apply for WGS data.


4 Session Info

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.30.0         ensemblVEP_1.38.0          
##  [3] VariantAnnotation_1.42.0    SummarizedExperiment_1.26.0
##  [5] Biobase_2.56.0              MatrixGenerics_1.8.0       
##  [7] matrixStats_0.62.0          Rsamtools_2.12.0           
##  [9] Biostrings_2.64.0           XVector_0.36.0             
## [11] GenomicRanges_1.48.0        GenomeInfoDb_1.32.0        
## [13] IRanges_2.30.0              S4Vectors_0.34.0           
## [15] BiocGenerics_0.42.0         MMAPPR2data_1.9.0          
## [17] MMAPPR2_1.10.0              BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2               sass_0.4.1               bit64_4.0.5             
##  [4] jsonlite_1.8.0           bslib_0.3.1              assertthat_0.2.1        
##  [7] BiocManager_1.30.17      VariantTools_1.38.0      BiocFileCache_2.4.0     
## [10] blob_1.2.3               BSgenome_1.64.0          GenomeInfoDbData_1.2.8  
## [13] yaml_2.3.5               progress_1.2.2           pillar_1.7.0            
## [16] RSQLite_2.2.12           lattice_0.20-45          glue_1.6.2              
## [19] digest_0.6.29            htmltools_0.5.2          Matrix_1.4-1            
## [22] XML_3.99-0.9             pkgconfig_2.0.3          biomaRt_2.52.0          
## [25] bookdown_0.26            zlibbioc_1.42.0          purrr_0.3.4             
## [28] tibble_3.1.6             KEGGREST_1.36.0          generics_0.1.2          
## [31] ellipsis_0.3.2           cachem_1.0.6             GenomicFeatures_1.48.0  
## [34] gmapR_1.38.0             cli_3.3.0                magrittr_2.0.3          
## [37] crayon_1.5.1             memoise_2.0.1            evaluate_0.15           
## [40] fansi_1.0.3              xml2_1.3.3               tools_4.2.0             
## [43] prettyunits_1.1.1        hms_1.1.1                BiocIO_1.6.0            
## [46] lifecycle_1.0.1          stringr_1.4.0            DelayedArray_0.22.0     
## [49] AnnotationDbi_1.58.0     compiler_4.2.0           jquerylib_0.1.4         
## [52] rlang_1.0.2              grid_4.2.0               RCurl_1.98-1.6          
## [55] rjson_0.2.21             rappdirs_0.3.3           bitops_1.0-7            
## [58] rmarkdown_2.14           restfulr_0.0.13          curl_4.3.2              
## [61] DBI_1.1.2                R6_2.5.1                 GenomicAlignments_1.32.0
## [64] rtracklayer_1.56.0       knitr_1.38               dplyr_1.0.8             
## [67] utf8_1.2.2               fastmap_1.1.0            bit_4.0.4               
## [70] filelock_1.0.2           stringi_1.7.6            parallel_4.2.0          
## [73] Rcpp_1.0.8.3             vctrs_0.4.1              png_0.1-7               
## [76] tidyselect_1.1.2         dbplyr_2.1.1             xfun_0.30