1 Simple, flexible and reusable annotation for metaseqR2 pipeline

When using the first version of metaseqR, one had to either embed the annotation to the gene/exon/3’ UTR counts, or to download and construct on-the-fly the required annotation when starting from BAM files. Although the counting and gene model construction results could (anf still can) be saved and re-used with other analysis parameters changed (e.g. statistical algorithms), one could not easily add for example new data to an existing dataset without re-running the whole pipeline and re-downloading annotation. On top of that, many times, the main Ensembl servers (when using Ensembl annotations) do not respond well to biomaRt calls, so the whole pipeline may stall until the servers are back.

Another main issue with the annotation used by metaseqR was that there was no straightforward way, provided by metaseqR, to archive and version the annotation used by a specific analysis and was up to the user to take care of reproducibility at this level. Furthermore, there was no straightforward way for a user to plugin own annotation elements (e.g. in the form of a GTF file) and use it in the same manner as standard annotations supported by metaseqR, e.g. when analyzing data from not-so-often studied organisms such as insects. Plugging-in own annotation was possible but usually a painful procedure, which has become now very easy.

The annotation database builder for metaseqR2 remedies the above situations. The buildAnnotationDatabase function should be run once with the organisms one requires to have locally to work with and then that’s it! Of course you can manage your database by adding and removing specific annotations (and you even can play with an SQLite browser, although not advised, as the database structure is rather simple). Furthermore, you can use the metaseqR2 annotation database and management mechanism for any other type of analysis where you require to have a simple tab-delimited annotation file, acquired with very little effort.

2 Supported organisms

The following organisms (essentially genome versions) are supported for automatic database builds:

Human (Homo sapiens) genome version hg38
Human (Homo sapiens) genome version hg19
Human (Homo sapiens) genome version hg18
Mouse (Mus musculus) genome version mm10
Mouse (Mus musculus) genome version mm9
Rat (Rattus norvegicus) genome version rn6
Rat (Rattus norvegicus) genome version rn5
Fruitfly (Drosophila melanogaster) genome version dm6
Fruitfly (Drosophila melanogaster) genome version dm3
Zebrafish (Danio rerio) genome version danRer7
Zebrafish (Danio rerio) genome version danRer10
Zebrafish (Danio rerio) genome version danRer11
Chimpanzee (Pan troglodytes) genome version panTro4
Chimpanzee (Pan troglodytes) genome version panTro5
Pig (Sus scrofa) genome version susScr3
Pig (Sus scrofa) genome version susScr11
Horse (Equus cabalus) genome version equCab2
Arabidopsis (Arabidobsis thaliana) genome version TAIR10

3 Using the local database

3.1 Installation of metaseqR2

To install the metaseqR2 package, start R and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("metaseqR2")

3.2 Setup the database

By default, the database file will be written in the system.file(package="metaseqR2") directory. You can specify another prefered destination for it using the db argument in the function call, but if you do that, you will have to supply the localDb argument pointing to the SQLite database file you created to every metaseqr2 call you perform, otherwise, the pipeline will download and use annotations on-the-fly.

In this vignette, we will build a minimal database comprising only the mouse mm10 genome version from Ensembl. The database will be build in a temporary directory inside session tempdir().

Important note: As the annotation build function makes use of Kent utilities for creating 3’UTR annotations from RefSeq and UCSC, the latter cannot be built in Windows. Therefore it is advised to either build the annotation database in a Linux system or use our pre-built databases.

library(metaseqR2)

buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)

# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")

# Since we are using Ensembl, we can also ask for a version
organisms <- list(mm10=100)
sources <- ifelse(.Platform$OS.type=="unix",c("ensembl","refseq"),"ensembl")

# If the example is not running in a multicore system, rc is ignored
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,db=myDb,rc=0.5)

## Opening metaseqR2 SQLite database /tmp/Rtmpwf6R1U/test_anndb/testann.sqlite

## Retrieving genome information for mm10 from ensembl

## Retrieving gene annotation for mm10 from ensembl version 100

## Using Ensembl host https://apr2020.archive.ensembl.org

## Retrieving transcript annotation for mm10 from ensembl version 100

## Using Ensembl host https://apr2020.archive.ensembl.org

## Merging transcripts for mm10 from ensembl version 100

## Retrieving 3' UTR annotation for mm10 from ensembl version 100

## Using Ensembl host https://apr2020.archive.ensembl.org

## Merging gene 3' UTRs for mm10 from ensembl version 100

## Merging transcript 3' UTRs for mm10 from ensembl version 100

## Retrieving exon annotation for mm10 from ensembl version 100

## Using Ensembl host https://apr2020.archive.ensembl.org

## Retrieving extended exon annotation for mm10 from ensembl version 100

## Using Ensembl host https://apr2020.archive.ensembl.org

## Merging exons for mm10 from ensembl version 100
## Merging exons for mm10 from ensembl version 100

3.3 Use the database

Now, that a small database is in place, let’s retrieve some data. Remember that since the built database is not in the default location, we need to pass the database file in each data retrieval function. The annotation is retrieved as a GRanges object by default.

# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
    db=myDb)
genes

## GRanges object with 55364 ranges and 4 metadata columns:
##                      seqnames            ranges strand |            gene_id
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSMUSG00000102693     chr1   3073253-3074322      + | ENSMUSG00000102693
##   ENSMUSG00000064842     chr1   3102016-3102125      + | ENSMUSG00000064842
##   ENSMUSG00000051951     chr1   3205901-3671498      - | ENSMUSG00000051951
##   ENSMUSG00000102851     chr1   3252757-3253236      + | ENSMUSG00000102851
##   ENSMUSG00000103377     chr1   3365731-3368549      - | ENSMUSG00000103377
##                  ...      ...               ...    ... .                ...
##   ENSMUSG00000095366     chrY 90752427-90755467      - | ENSMUSG00000095366
##   ENSMUSG00000095134     chrY 90753057-90763485      + | ENSMUSG00000095134
##   ENSMUSG00000096768     chrY 90784738-90816465      + | ENSMUSG00000096768
##   ENSMUSG00000099871     chrY 90837413-90844040      + | ENSMUSG00000099871
##   ENSMUSG00000096850     chrY 90838869-90839177      - | ENSMUSG00000096850
##                      gc_content     gene_name                biotype
##                       <numeric>   <character>            <character>
##   ENSMUSG00000102693      34.21 4933401J01Rik                    TEC
##   ENSMUSG00000064842      36.36       Gm26206                  snRNA
##   ENSMUSG00000051951      38.51          Xkr4         protein_coding
##   ENSMUSG00000102851      39.79       Gm18956   processed_pseudogene
##   ENSMUSG00000103377      40.79       Gm37180                    TEC
##                  ...        ...           ...                    ...
##   ENSMUSG00000095366      41.37       Gm21860                lincRNA
##   ENSMUSG00000095134      46.85      Mid1-ps1 unprocessed_pseudogene
##   ENSMUSG00000096768      46.16       Gm47283                lincRNA
##   ENSMUSG00000099871      43.39       Gm21742 unprocessed_pseudogene
##   ENSMUSG00000096850      48.87       Gm21748         protein_coding
##   -------
##   seqinfo: 21 sequences from mm10 genome

# Load standard annotation based on 3' UTR coordinates
utrs <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="utr",
    db=myDb)
utrs

## GRanges object with 228087 ranges and 4 metadata columns:
##                      seqnames            ranges strand |      transcript_id
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSMUST00000193812     chr1   3074323-3074571      + | ENSMUST00000193812
##   ENSMUST00000082908     chr1   3102126-3102374      + | ENSMUST00000082908
##   ENSMUST00000162897     chr1   3205652-3205900      - | ENSMUST00000162897
##   ENSMUST00000159265     chr1   3206274-3206522      - | ENSMUST00000159265
##   ENSMUST00000070533     chr1   3214233-3214481      - | ENSMUST00000070533
##                  ...      ...               ...    ... .                ...
##   ENSMUST00000177591     chrY 90816465-90816713      + | ENSMUST00000177591
##   ENSMUST00000179077     chrY 90816465-90816713      + | ENSMUST00000179077
##   ENSMUST00000238471     chrY 90816466-90816714      + | ENSMUST00000238471
##   ENSMUST00000179623     chrY 90838620-90838868      - | ENSMUST00000179623
##   ENSMUST00000189352     chrY 90844041-90844289      + | ENSMUST00000189352
##                                 gene_id     gene_name                biotype
##                             <character>   <character>            <character>
##   ENSMUST00000193812 ENSMUSG00000102693 4933401J01Rik                    TEC
##   ENSMUST00000082908 ENSMUSG00000064842       Gm26206                  snRNA
##   ENSMUST00000162897 ENSMUSG00000051951          Xkr4         protein_coding
##   ENSMUST00000159265 ENSMUSG00000051951          Xkr4         protein_coding
##   ENSMUST00000070533 ENSMUSG00000051951          Xkr4         protein_coding
##                  ...                ...           ...                    ...
##   ENSMUST00000177591 ENSMUSG00000096768       Gm47283                lincRNA
##   ENSMUST00000179077 ENSMUSG00000096768       Gm47283                lincRNA
##   ENSMUST00000238471 ENSMUSG00000096768       Gm47283                lincRNA
##   ENSMUST00000179623 ENSMUSG00000096850       Gm21748         protein_coding
##   ENSMUST00000189352 ENSMUSG00000099871       Gm21742 unprocessed_pseudogene
##   -------
##   seqinfo: 21 sequences from mm10 genome

# Load summarized exon annotation based used with RNA-Seq analysis
sumEx <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="exon",
    summarized=TRUE,db=myDb)
sumEx

## GRanges object with 291497 ranges and 4 metadata columns:
##                            seqnames            ranges strand |
##                               <Rle>         <IRanges>  <Rle> |
##   ENSMUSG00000102693_MEX_1     chr1   3073253-3074322      + |
##   ENSMUSG00000064842_MEX_1     chr1   3102016-3102125      + |
##   ENSMUSG00000051951_MEX_1     chr1   3205901-3207317      - |
##   ENSMUSG00000051951_MEX_2     chr1   3213439-3216968      - |
##   ENSMUSG00000102851_MEX_1     chr1   3252757-3253236      + |
##                        ...      ...               ...    ... .
##   ENSMUSG00000099871_MEX_1     chrY 90837413-90837520      + |
##   ENSMUSG00000096850_MEX_1     chrY 90838869-90839177      - |
##   ENSMUSG00000099871_MEX_2     chrY 90841657-90841805      + |
##   ENSMUSG00000099871_MEX_3     chrY 90842898-90843025      + |
##   ENSMUSG00000099871_MEX_4     chrY 90843878-90844040      + |
##                                           exon_id            gene_id
##                                       <character>        <character>
##   ENSMUSG00000102693_MEX_1 ENSMUSG00000102693_M.. ENSMUSG00000102693
##   ENSMUSG00000064842_MEX_1 ENSMUSG00000064842_M.. ENSMUSG00000064842
##   ENSMUSG00000051951_MEX_1 ENSMUSG00000051951_M.. ENSMUSG00000051951
##   ENSMUSG00000051951_MEX_2 ENSMUSG00000051951_M.. ENSMUSG00000051951
##   ENSMUSG00000102851_MEX_1 ENSMUSG00000102851_M.. ENSMUSG00000102851
##                        ...                    ...                ...
##   ENSMUSG00000099871_MEX_1 ENSMUSG00000099871_M.. ENSMUSG00000099871
##   ENSMUSG00000096850_MEX_1 ENSMUSG00000096850_M.. ENSMUSG00000096850
##   ENSMUSG00000099871_MEX_2 ENSMUSG00000099871_M.. ENSMUSG00000099871
##   ENSMUSG00000099871_MEX_3 ENSMUSG00000099871_M.. ENSMUSG00000099871
##   ENSMUSG00000099871_MEX_4 ENSMUSG00000099871_M.. ENSMUSG00000099871
##                                gene_name                biotype
##                              <character>            <character>
##   ENSMUSG00000102693_MEX_1 4933401J01Rik                    TEC
##   ENSMUSG00000064842_MEX_1       Gm26206                  snRNA
##   ENSMUSG00000051951_MEX_1          Xkr4         protein_coding
##   ENSMUSG00000051951_MEX_2          Xkr4         protein_coding
##   ENSMUSG00000102851_MEX_1       Gm18956   processed_pseudogene
##                        ...           ...                    ...
##   ENSMUSG00000099871_MEX_1       Gm21742 unprocessed_pseudogene
##   ENSMUSG00000096850_MEX_1       Gm21748         protein_coding
##   ENSMUSG00000099871_MEX_2       Gm21742 unprocessed_pseudogene
##   ENSMUSG00000099871_MEX_3       Gm21742 unprocessed_pseudogene
##   ENSMUSG00000099871_MEX_4       Gm21742 unprocessed_pseudogene
##   -------
##   seqinfo: 21 sequences from mm10 genome

# Load standard annotation based on gene body coordinates from RefSeq
if (.Platform$OS.type=="unix") {
    refGenes <- loadAnnotation(genome="mm10",refdb="refseq",level="gene",
        type="gene",db=myDb)
    refGenes
}

## Getting latest annotation on the fly for mm10 from refseq

## Retrieving genome information for mm10 from refseq

## Retrieving latest gene annotation for mm10 from refseq

## Loading required namespace: RMySQL

## Converting annotation to GenomicRanges object...

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## 
## Attaching package: 'rtracklayer'

## The following object is masked from 'package:BiocIO':
## 
##     FileForFormat

## Getting DNA sequences...

## Getting GC content...

## GRanges object with 23471 ranges and 4 metadata columns:
##                seqnames            ranges strand |      gene_id gc_content
##                   <Rle>         <IRanges>  <Rle> |  <character>  <numeric>
##   NM_001011874     chr1   3214481-3671498      - | NM_001011874      38.56
##   NM_001370921     chr1   4119865-4360303      - | NM_001370921      38.35
##      NM_011441     chr1   4490927-4497354      - |    NM_011441      49.75
##   NM_001177658     chr1   4773199-4785726      - | NM_001177658      42.59
##      NM_008866     chr1   4807822-4846735      + |    NM_008866      40.99
##            ...      ...               ...    ... .          ...        ...
##   NM_001160135     chrY 67339048-67340047      - | NM_001160135      43.50
##   NM_001037748     chrY 72554831-72581058      + | NM_001037748      35.58
##   NM_001160144     chrY 77073913-77076246      - | NM_001160144      39.85
##      NR_137283     chrY 90751139-90755050      - |    NR_137283      41.51
##      NR_137282     chrY 90785441-90816465      + |    NR_137282      46.22
##                    gene_name     biotype
##                  <character> <character>
##   NM_001011874          Xkr4          NA
##   NM_001370921           Rp1          NA
##      NM_011441         Sox17          NA
##   NM_001177658        Mrpl15          NA
##      NM_008866        Lypla1          NA
##            ...           ...         ...
##   NM_001160135       Gm20806          NA
##   NM_001037748       Gm20736          NA
##   NM_001160144       Gm20816          NA
##      NR_137283 G530011O06Rik          NA
##      NR_137282         Erdr1          NA
##   -------
##   seqinfo: 21 sequences from mm10 genome

Or as a data frame if you prefer using asdf=TRUE. The data frame however does not contain metadata like Seqinfo to be used for any susequent validations:

# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
    db=myDb,asdf=TRUE)
head(genes)

##   chromosome   start     end            gene_id gc_content strand     gene_name
## 1       chr1 3073253 3074322 ENSMUSG00000102693      34.21      + 4933401J01Rik
## 2       chr1 3102016 3102125 ENSMUSG00000064842      36.36      +       Gm26206
## 3       chr1 3205901 3671498 ENSMUSG00000051951      38.51      -          Xkr4
## 4       chr1 3252757 3253236 ENSMUSG00000102851      39.79      +       Gm18956
## 5       chr1 3365731 3368549 ENSMUSG00000103377      40.79      -       Gm37180
## 6       chr1 3375556 3377788 ENSMUSG00000104017      36.99      -       Gm37363
##                biotype
## 1                  TEC
## 2                snRNA
## 3       protein_coding
## 4 processed_pseudogene
## 5                  TEC
## 6                  TEC

3.4 Add a custom annotation

Apart from the supported organisms and databases, you can add a custom annotation. Such an annotation can be:

A non-supported organism (e.g. an insect or another mammal e.g. dog)
A modification or further curation you have done to existing/supported annotations
A supported organism but from a different source
Any other case where the provided annotations are not adequate

This can be achieved through the usage of GTF files, along with some simple metadata that you have to provide for proper import to the annotation database. This can be achieved through the usage of the buildCustomAnnotation function. Details on required metadata can be found in the function’s help page.

Important note: Please note that importing a custom genome annotation directly from UCSC (UCSC SQL database dumps) is not supported in Windows as the process involves using the genePredToGtf which is not available for Windows.

Let’s try a couple of exammples. The first one is a custom annotation for the Ebola virus from UCSC:

# Setup a temporary directory to download files etc.
customDir <- file.path(tempdir(),"test_custom")
dir.create(customDir)

# Convert from GenePred to GTF - Unix/Linux only!
if (.Platform$OS.type == "unix" && !grepl("^darwin",R.version$os)) {
    # Download data from UCSC
    goldenPath="http://hgdownload.cse.ucsc.edu/goldenPath/"
    # Gene annotation dump
    download.file(paste0(goldenPath,"eboVir3/database/ncbiGene.txt.gz"),
        file.path(customDir,"eboVir3_ncbiGene.txt.gz"))
    # Chromosome information
    download.file(paste0(goldenPath,"eboVir3/database/chromInfo.txt.gz"),
        file.path(customDir,"eboVir3_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"eboVir3_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtfEnv <- Sys.getenv("GENEPREDTOGTF_BINARY")
    if (genePredToGtfEnv == "") {
        genePredToGtf <- file.path(customDir,"genePredToGtf")
    } else {
        genePredToGtf <- file.path(genePredToGtfEnv)
    }
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"eboVir3.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"eboVir3_ncbiGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=eboVir3"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="eboVir3_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    eboGenes <- loadAnnotation(genome="eboVir3_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    eboGenes
}

## Opening metaseqR2 SQLite database /tmp/Rtmpwf6R1U/test_anndb/testann.sqlite

##   Importing GTF /tmp/Rtmpwf6R1U/test_custom/eboVir3.gtf as GTF to make id map

##   Making id map

##   Importing GTF /tmp/Rtmpwf6R1U/test_custom/eboVir3.gtf as TxDb

## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for ebovir3_test from ucsc_test version 20241029 from /tmp/Rtmpwf6R1U/test_custom/eboVir3.gtf
## Retrieving transcript annotation for ebovir3_test from ucsc_test version 20241029
## Retrieving summarized transcript annotation for ebovir3_test from ucsc_test version 20241029
## Retrieving 3' UTR annotation for ebovir3_test from ucsc_test version 20241029
## Retrieving summarized 3' UTR annotation per gene for ebovir3_test from ucsc_test version 20241029
##   summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for ebovir3_test from ucsc_test version 20241029
##   summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for ebovir3_test from ucsc_test version 20241029
## Retrieving summarized exon annotation for ebovir3_test from ucsc_test version 20241029
##   summarizing exons per gene for imported GTF

## GRanges object with 9 ranges and 4 metadata columns:
##          seqnames      ranges strand |     gene_id gc_content   gene_name
##             <Rle>   <IRanges>  <Rle> | <character>  <numeric> <character>
##     NP KM034562v1     56-3026      + |          NP         50          NP
##   VP35 KM034562v1   3032-4407      + |        VP35         50        VP35
##   VP40 KM034562v1   4390-5894      + |        VP40         50        VP40
##     GP KM034562v1   5900-8305      + |          GP         50          GP
##    sGP KM034562v1   5900-8305      + |         sGP         50         sGP
##   ssGP KM034562v1   5900-8305      + |        ssGP         50        ssGP
##   VP30 KM034562v1   8288-9740      + |        VP30         50        VP30
##   VP24 KM034562v1  9885-11518      + |        VP24         50        VP24
##      L KM034562v1 11501-18282      + |           L         50           L
##            biotype
##        <character>
##     NP        gene
##   VP35        gene
##   VP40        gene
##     GP        gene
##    sGP        gene
##   ssGP        gene
##   VP30        gene
##   VP24        gene
##      L        gene
##   -------
##   seqinfo: 1 sequence from ebovir3_test genome

Another example, the Atlantic cod from UCSC. The same things apply for the operating system.

if (.Platform$OS.type == "unix") {
    # Gene annotation dump
    download.file(paste0(goldenPath,"gadMor1/database/augustusGene.txt.gz"),
        file.path(customDir,"gadMori1_augustusGene.txt.gz"))
    # Chromosome information
    download.file(paste(goldenPath,"gadMor1/database/chromInfo.txt.gz",sep=""),
        file.path(customDir,"gadMori1_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"gadMori1_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtf <- file.path(customDir,"genePredToGtf")
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"gadMori1.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"gadMori1_augustusGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=gadMori1"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="gadMor1_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    gadGenes <- loadAnnotation(genome="gadMor1_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    gadGenes
}

Another example, Armadillo from Ensembl. This should work irrespectively of operating system. We are downloading chromosomal information from UCSC.

# Gene annotation dump from Ensembl
download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/",
    "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"))

# Chromosome information will be provided from the following BAM file
# available from Ensembl. We have noticed that when using Windows as the OS,
# a remote BAM files cannot be opened by scanBamParam, so for this example,
# chromosome length information will not be available when running in Windows.
bamForInfo <- NULL
if (.Platform$OS.type == "unix")
    bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/",
        "dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam")

# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
    gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    metadata=list(
        organism="dasNov3_test",
        source="ensembl_test",
        chromInfo=bamForInfo
    ),
    db=myDb
)

# Try to retrieve some data
dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test",
    level="gene",type="gene",db=myDb)
dasGenes

3.5 A complete build

A quite complete build (with latest versions of Ensembl annotations) would look like (supposing the default annotation database location):

organisms <- list(
    hg18=54,
    hg19=75,
    hg38=110:111,
    mm9=54,
    mm10=110:111,
    rn5=77,
    rn6=110:111,
    dm3=77,
    dm6=110:111,
    danrer7=77,
    danrer10=80,
    danrer11=110:111,
    pantro4=80,
    pantro5=110:111,
    susscr3=80,
    susscr11=110:111,
    equcab2=110:111
)

sources <- c("ensembl","ucsc","refseq")

buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,rc=0.5)

The aforementioned complete built can be found here Complete builts will become available from time to time (e.g. with every new Ensembl relrase) for users who do not wish to create annotation databases on their own. Root access may be required (depending on the metaseqR2 library location) to place it in the default location where it can be found automatically.

4 Annotations on-the-fly

If for some reason you do not want to build and use an annotation database for metaseqR2 analyses (not recommended) or you wish to perform an analysis with an organism that does not yet exist in the database, the loadAnnotation function will perform all required actions (download and create a GRanges object) on-the-fly as long as there is an internet connection.

However, the above function does not handle custom annotations in GTF files. In a scenario where you want to use a custom annotation only once, you should supply the annotation argument to the metaseqr2 function, which is almost the same as the metadata argument used in buildCustomAnnotation, actually augmented by a list member for the GTF file, that is:

annotation <- list(
    gtf="PATH_TO_GTF",
    organism="ORGANISM_NAME",
    source="SOURCE_NAME",
    chromInfo="CHROM_INFO"
)

The above argument can be passed to the metaseqr2 call in the respective position.

For further details about custom annotations on the fly, please check buildCustomAnnotation and importCustomAnnotation functions.

5 Session Info

sessionInfo()

## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.75.0                   
##  [3] rtracklayer_1.67.0                 BiocIO_1.17.0                     
##  [5] Biostrings_2.75.0                  XVector_0.47.0                    
##  [7] metaseqR2_1.19.0                   locfit_1.5-9.10                   
##  [9] limma_3.63.0                       DESeq2_1.47.0                     
## [11] SummarizedExperiment_1.37.0        Biobase_2.67.0                    
## [13] MatrixGenerics_1.19.0              matrixStats_1.4.1                 
## [15] GenomicRanges_1.59.0               GenomeInfoDb_1.43.0               
## [17] IRanges_2.41.0                     S4Vectors_0.45.0                  
## [19] BiocGenerics_0.53.0                BiocStyle_2.35.0                  
## 
## loaded via a namespace (and not attached):
##   [1] survcomp_1.57.0           bitops_1.0-9             
##   [3] filelock_1.0.3            tibble_3.2.1             
##   [5] R.oo_1.26.0               preprocessCore_1.69.0    
##   [7] XML_3.99-0.17             lifecycle_1.0.4          
##   [9] httr2_1.0.5               pwalign_1.3.0            
##  [11] edgeR_4.5.0               globals_0.16.3           
##  [13] MASS_7.3-61               lattice_0.22-6           
##  [15] dendextend_1.18.1         magrittr_2.0.3           
##  [17] plotly_4.10.4             sass_0.4.9               
##  [19] rmarkdown_2.28            jquerylib_0.1.4          
##  [21] yaml_2.3.10               DBI_1.2.3                
##  [23] RColorBrewer_1.1-3        ABSSeq_1.61.0            
##  [25] harmonicmeanp_3.0.1       abind_1.4-8              
##  [27] ShortRead_1.65.0          zlibbioc_1.53.0          
##  [29] purrr_1.0.2               R.utils_2.12.3           
##  [31] rmeta_3.0                 RCurl_1.98-1.16          
##  [33] rappdirs_0.3.3            lava_1.8.0               
##  [35] seriation_1.5.6           GenomeInfoDbData_1.2.13  
##  [37] survivalROC_1.0.3.1       listenv_0.9.1            
##  [39] genefilter_1.89.0         parallelly_1.38.0        
##  [41] annotate_1.85.0           permute_0.9-7            
##  [43] DelayedMatrixStats_1.29.0 codetools_0.2-20         
##  [45] DelayedArray_0.33.0       DT_0.33                  
##  [47] xml2_1.3.6                SuppDists_1.1-9.8        
##  [49] tidyselect_1.2.1          futile.logger_1.4.3      
##  [51] UCSC.utils_1.3.0          viridis_0.6.5            
##  [53] TSP_1.2-4                 rmdformats_1.0.4         
##  [55] BiocFileCache_2.15.0      webshot_0.5.5            
##  [57] GenomicAlignments_1.43.0  jsonlite_1.8.9           
##  [59] iterators_1.0.14          survival_3.7-0           
##  [61] foreach_1.5.2             tools_4.5.0              
##  [63] progress_1.2.3            Rcpp_1.0.13              
##  [65] glue_1.8.0                prodlim_2024.06.25       
##  [67] gridExtra_2.3             SparseArray_1.7.0        
##  [69] xfun_0.48                 qvalue_2.39.0            
##  [71] ca_0.71.1                 dplyr_1.1.4              
##  [73] HDF5Array_1.35.0          withr_3.0.2              
##  [75] NBPSeq_0.3.1              formatR_1.14             
##  [77] BiocManager_1.30.25       fastmap_1.2.0            
##  [79] latticeExtra_0.6-30       rhdf5filters_1.19.0      
##  [81] fansi_1.0.6               caTools_1.18.3           
##  [83] digest_0.6.37             R6_2.5.1                 
##  [85] colorspace_2.1-1          RMySQL_0.10.29           
##  [87] gtools_3.9.5              jpeg_0.1-10              
##  [89] biomaRt_2.63.0            RSQLite_2.3.7            
##  [91] R.methodsS3_1.8.2         tidyr_1.3.1              
##  [93] utf8_1.2.4                generics_0.1.3           
##  [95] data.table_1.16.2         prettyunits_1.2.0        
##  [97] httr_1.4.7                htmlwidgets_1.6.4        
##  [99] S4Arrays_1.7.0            pkgconfig_2.0.3          
## [101] gtable_0.3.6              registry_0.5-1           
## [103] blob_1.2.4                hwriter_1.3.2.1          
## [105] htmltools_0.5.8.1         bookdown_0.41            
## [107] log4r_0.4.4               scales_1.3.0             
## [109] png_0.1-8                 corrplot_0.95            
## [111] knitr_1.48                lambda.r_1.2.4           
## [113] reshape2_1.4.4            rjson_0.2.23             
## [115] curl_5.2.3                zoo_1.8-12               
## [117] cachem_1.1.0              rhdf5_2.51.0             
## [119] stringr_1.5.1             KernSmooth_2.23-24       
## [121] parallel_4.5.0            AnnotationDbi_1.69.0     
## [123] vsn_3.75.0                restfulr_0.0.15          
## [125] pillar_1.9.0              grid_4.5.0               
## [127] vctrs_0.6.5               gplots_3.2.0             
## [129] dbplyr_2.5.0              xtable_1.8-4             
## [131] evaluate_1.0.1            bsseq_1.43.0             
## [133] VennDiagram_1.7.3         GenomicFeatures_1.59.0   
## [135] cli_3.6.3                 compiler_4.5.0           
## [137] futile.options_1.0.1      Rsamtools_2.23.0         
## [139] rlang_1.1.4               crayon_1.5.3             
## [141] FMStable_0.1-4            future.apply_1.11.3      
## [143] heatmaply_1.5.0           interp_1.1-6             
## [145] aroma.light_3.37.0        affy_1.85.0              
## [147] plyr_1.8.9                pander_0.6.5             
## [149] stringi_1.8.4             viridisLite_0.4.2        
## [151] deldir_2.0-4              BiocParallel_1.41.0      
## [153] assertthat_0.2.1          txdbmaker_1.3.0          
## [155] munsell_0.5.1             lazyeval_0.2.2           
## [157] DSS_2.55.0                EDASeq_2.41.0            
## [159] Matrix_1.7-1              hms_1.1.3                
## [161] future_1.34.0             sparseMatrixStats_1.19.0 
## [163] bit64_4.5.2               ggplot2_3.5.1            
## [165] Rhdf5lib_1.29.0           KEGGREST_1.47.0          
## [167] statmod_1.5.0             memoise_2.0.1            
## [169] affyio_1.77.0             bslib_0.8.0              
## [171] bootstrap_2019.6          bit_4.5.0

Building an annotation database for metaseqR2

29 October 2024