InPAS 2.2.0
Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in Eukaryotes. Like alternative splicing, APA can increase transcriptome diversity. In addition, it defines 3’ UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can affect many biological processed, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data dedicated to identifying APA events.
However, massive RNA-seq datasets, which were originally created to quantify genome-wide gene expression, are available in public databases such as GEO and TCGA. These RNA-seq datasets also contain information of genome-wide APA. Thus, we developed the InPAS package for identifying APA from the conventional RNA-seq data.
The major procedures in InPAS workflow are as follows:
In addition, the InPAS package also provide functions to perform quality control over RNA-seq data coverage, visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.
First, load the required packages, including InPAS, and species-specific genome and genome annotation database: BSgenome, TxDb and EnsDb.
suppressPackageStartupMessages({
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(EnsDb.Hsapiens.v86)
library(EnsDb.Mmusculus.v79)
library(cleanUpdTSeq)
library("RSQLite")
library("DBI")
})
3’ UTR annotation, including start and end coordinates, and strand information of 3’ UTRs, last CDS and the gaps between 3’ extremities of 3’ UTRs and immediate downstream exons, is extracted using the function extract_UTR3Anno
from genome annotation databases: a TxDb database and an Ensembldb database for a species of interest. For demonstration, the following snippet of R scripts shows how to extract 3’ UTR annotation from a abridged TxDb for a human reference genome (hg19). In reality, users should use a TxDb for the most reliable genome annotation of the PRIMARY reference genome assembly (NOT including the alternative patches) used for RNA-seq read alignment. If a TxDb is not available for the species of interest, users can build one using the function makeTxDbFromUCSC, makeTxDbFromBiomart, makeTxDbFromEnsembl, or makeTxDbFromGFF from the GenomicFeatures
package, depending on the sources of the genome annotation file.
samplefile <- system.file("extdata",
"hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
TxDb <- loadDb(samplefile)
utr3_anno <- extract_UTR3Anno(TxDb = TxDb,
edb = EnsDb.Hsapiens.v86,
removeScaffolds = TRUE)
## Warning: Unable to map 3 of 42 requested IDs.
head(utr3_anno$chr1)
## GRanges object with 6 ranges and 8 metadata columns:
## seqnames ranges
## <Rle> <IRanges>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 chr1 32673684-32674288
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 chr1 153333315-153333503
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 chr1 155719929-155720673
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 chr1 165533062-165533185
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 chr1 207245717-207251162
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 chr1 207252365-207254368
## strand | exon_rank transcript
## <Rle> | <integer> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 + | 5 uc001bum.2
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 + | 3 uc001fbq.3
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 + | 13 uc031pqa.1
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 + | 2 uc001gde.2
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 + | 15 uc001hfg.3
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 + | 15 uc001hfh.3
## feature gene
## <character> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 utr3 55721
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 utr3 6280
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 utr3 100129405
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 utr3 440699
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 utr3 5208
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 utr3 5208
## exon symbol
## <character> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 chr1:lastutr3:uc001b.. IQCC
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 chr1:lastutr3:uc001f.. S100A9
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 chr1:lastutr3:uc031p.. 100129405
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 chr1:lastutr3:uc001g.. LRRC52
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h.. PFKFB2
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h.. PFKFB2
## annotatedProximalCP truncated
## <character> <logical>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 unknown FALSE
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 unknown FALSE
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 proximalCP_155720479 FALSE
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 unknown FALSE
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 unknown FALSE
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 unknown FALSE
## -------
## seqinfo: 24 sequences from an unspecified genome
This vignette will use the prepared 3’ UTR annotation for the mouse reference genome mm10 for subsequent demonstration
# load R object: utr3.mm10
data(utr3.mm10)
## convert the GRanges into GRangesList for the 3' UTR annotation
utr3.mm10 <- split(utr3.mm10, seqnames(utr3.mm10))
Seven tables are created in the database. Table “metadata” stores the metadata, including information for tag (sample name), condition (experimental treatment group), bedgraph_file (paths to BEDGraph files), and depth (whole genome coverage depth) which is initially set to zeros and later updated during analysis. Tables “sample_coverage”, “chromosome_coverage”, “total_coverage”, “utr3_total_coverage”, “CPsites”, and “utr3cds_coverage” store names of intermediate files and the chromosome and tag (sample name) relevant to the files.
data_dir <- system.file("extdata", package = "InPAS")
bedgraphs <- c(file.path(data_dir, "Baf3.extract.bedgraph"),
file.path(data_dir, "UM15.extract.bedgraph"))
hugeData <- FALSE
genome <- BSgenome.Mmusculus.UCSC.mm10
tags <- c("Baf3", "UM15")
metadata <- data.frame(tag = tags,
condition = c("Baf3", "UM15"),
bedgraph_file = bedgraphs)
## In reality, don't use a temporary directory for your analysis. Instead, use a
## persistent directory to save your analysis output.
outdir = tempdir()
write.table(metadata, file =file.path(outdir, "metadata.txt"),
sep = "\t", quote = FALSE, row.names = FALSE)
sqlite_db <- setup_sqlitedb(metadata = file.path(outdir,
"metadata.txt"),
outdir)
## check the database
db_conn <- dbConnect(drv = RSQLite::SQLite(), dbname = sqlite_db)
dbListTables(db_conn)
## [1] "CPsites" "chromosome_coverage" "metadata"
## [4] "sample_coverage" "total_coverage" "utr3_total_coverage"
## [7] "utr3cds_coverage"
dbReadTable(db_conn, "metadata")
## tag condition
## 1 Baf3 Baf3
## 2 UM15 UM15
## bedgraph_file depth
## 1 /tmp/RtmpjXnMiT/Rinst113e22f838bd6/InPAS/extdata/Baf3.extract.bedgraph 0
## 2 /tmp/RtmpjXnMiT/Rinst113e22f838bd6/InPAS/extdata/UM15.extract.bedgraph 0
dbDisconnect(db_conn)
Before this step, genome coverage in the BEDGraph format should be prepared from BAM files resulted from RNA-seq data alignment using the genomecov
command in the BEDTools suite. BAM files can be filtered to remove multi-mapping alignments, alignments with low mapping quality and so on. Commands for reference are as follows:
## for single end RNA-seq data aligned with STAR
## -q 255, unique mapping
samtools view -bu -h -q 255 /path/to/XXX.SE.bam | \
bedtools genomecov -ibam - -bga -split > XXX.SE.uniq.bedgraph
## for paired-end RNA-seq data aligned with STAR
samtools view -bu -h -q 255 /path/to/XXX.PE.bam | \
bedtools genomecov -ibam - -bga -split > XXX.PE.uniq.bedgraph
The genome coverage data in the BEDGraph formatis converted into R objects of Rle-class using the get_ssRleCov
function for each chromosome of each sample. Rle objects for each individual chromosome are save to outdir
. The filename, tag (sample name), and chromosome name are save to Table “sample_coverage”. Subsequently, chromosome-specific Rle objects for all samples are assemble together into a two-level list of Rle objects, with level 1 being the chromosome name and level 2 being Rle for each tag (sample name). Notably, the sample BEDGraph files used here only contain coverage data for “chr6” of the mouse reference genome mm10.
coverage <- list()
for (i in seq_along(bedgraphs)){
coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i],
tag = tags[i],
genome = genome,
sqlite_db = sqlite_db,
outdir = outdir,
removeScaffolds = TRUE,
BPPARAM = NULL)
}
coverage_files <- assemble_allCov(sqlite_db,
outdir,
genome,
removeScaffolds = FALSE)
At this point, users can check the data quality in terms of coverage for all and expressed genes and 3’ UTRs using run_coverageQC
. This function output summarized coverage metrics: gene.coverage.rate, expressed.gene.coverage.rate, UTR3.coverage.rate, and UTR3.expressed.gene.subset.coverage.rate. The coverage rate of quality data should be greater than 0.75 for 3’ UTRs of expressed genes.
edb <- EnsDb.Mmusculus.v79
TxDb <- TxDb.Mmusculus.UCSC.mm10.knownGene
run_coverageQC(sqlite_db, TxDb, edb, genome,
removeScaffolds = TRUE,
which = GRanges("chr6",
ranges = IRanges(98013000, 140678000)))
## strand information will be ignored.
## Warning: Unable to map 6025 of 24568 requested IDs.
## gene.coverage.rate expressed.gene.coverage.rate UTR3.coverage.rate
## Baf3 0.003463505 0.5778441 0.01419771
## UM15 0.003428528 0.5719748 0.01405159
## UTR3.expressed.gene.subset.coverage.rate
## Baf3 0.8035821
## UM15 0.7953112
depth weight, Z-score cutoff thresholds, and total coverage along 3’ UTRs merged across biological replicates within each condition (huge data) or individual sample (non-huge data) are returned by the setup_CPsSearch
function. Potential novel CP sites are identified for each chromosome using the search_CPs
function. These potential CP sites can be filtered and/or adjusted using the Naive Bayes classifier provided by cleanUpdTseq and/or by using the polyadenylation scores by simply matching the position-weight matrix (PWM) for the hexamer polyadenylation signal (AAUAAA and the like).
## PWM for hexamer polyadenylation signal (AAUAAA and the like) from the InPAS
## package
load(system.file("extdata", "polyA.rda", package = "InPAS"))
## load the Naive Bayes classifier model for classify CP sites from the
## cleanUpdTseq package
data(classifier)
prepared_data <- setup_CPsSearch(sqlite_db,
genome,
utr3 = utr3.mm10,
background = "10K",
TxDb = TxDb,
removeScaffolds = TRUE,
BPPARAM = NULL,
hugeData = TRUE,
outdir=outdir,
silence = TRUE)
cpsites <- search_CPs(seqname = "chr6",
sqlite_db = sqlite_db,
utr3 = utr3.mm10,
background = prepared_data$background,
z2s = prepared_data$z2s,
depth.weight = prepared_data$depth.weight,
genome = genome,
MINSIZE = 10,
window_size = 100,
search_point_START = 50,
search_point_END = NA,
cutStart = 10,
cutEnd = 0,
adjust_distal_polyA_end = TRUE,
coverage_threshold = 5,
long_coverage_threshold = 2,
PolyA_PWM = pwm,
classifier = classifier,
classifier_cutoff = 0.8,
shift_range = 100,
step = 5,
two_way = FALSE,
hugeData = TRUE,
outdir = outdir,
silence = TRUE)
Estimate usage of proximal and distal CP sites based on read coverage along the short and long 3’ UTRs
utr3_cds_cov <- get_regionCov(chr.utr3 = utr3.mm10[["chr6"]],
sqlite_db,
outdir,
BPPARAM = NULL,
phmm = FALSE)
eSet <- get_UTR3eSet(sqlite_db,
normalize ="none",
singleSample = FALSE)
InPAS provides the function test_dPDUI
to identify differential usage of proximal and distal CP sites between different conditions leveraging different statistical models according to the experimental design. InPAS offers statistical methods for single sample differential PDUI analysis, and single group analysis. Additionally, InPAS provides Fisher exact test for two-group unreplicated design, and empirical Bayes linear model leveraging the limma package for more complex design. The test results can be further filtered using the filter_testOut
function based on the fraction samples within each condition with coverage data for the identified differential PDUI events, and/or cutoffs of nominal p-values, adjusted p-values or log2 (fold change).
test_out <- test_dPDUI(eset = eSet,
method = "fisher.exact",
normalize = "none",
sqlite_db = sqlite_db)
filter_out <- filter_testOut(res = test_out,
gp1 = "Baf3",
gp2 = "UM15",
background_coverage_threshold = 2,
P.Value_cutoff = 0.05,
adj.P.Val_cutoff = 0.05,
dPDUI_cutoff = 0.3,
PDUI_logFC_cutoff = 0.59)
InPAS package also provide functions, get_usage4plot
, plot_utr3Usage
, and setup_GSEA
, to visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.
## Visualize dPDUI events
gr <- GRanges("chr6", IRanges(128846245, 128850081), strand = "-")
names(gr) <- "128846245-128850081"
data4plot <- get_usage4plot(gr,
proximalSites = 128849193,
sqlite_db,
hugeData = TRUE)
plot_utr3Usage(usage_data = data4plot,
vline_color = "purple",
vline_type = "dashed")
## prepare a rank file for GSEA
setup_GSEA(eset = test_out,
groupList= list(Baf3 = "Baf3", UM15 ="UM15"),
outdir = outdir,
preranked = TRUE,
rankBy = "logFC",
rnkFilename = "InPAS.rnk")
sessionInfo()
R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS
Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DBI_1.1.1
[2] RSQLite_2.2.8
[3] cleanUpdTSeq_1.32.0
[4] BSgenome.Drerio.UCSC.danRer7_1.4.0
[5] EnsDb.Mmusculus.v79_2.99.0
[6] EnsDb.Hsapiens.v86_2.99.0
[7] ensembldb_2.18.0
[8] AnnotationFilter_1.18.0
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
[10] GenomicFeatures_1.46.0
[11] AnnotationDbi_1.56.0
[12] BSgenome.Mmusculus.UCSC.mm10_1.4.3
[13] BSgenome_1.62.0
[14] rtracklayer_1.54.0
[15] Biostrings_2.62.0
[16] XVector_0.34.0
[17] InPAS_2.2.0
[18] GenomicRanges_1.46.0
[19] GenomeInfoDb_1.30.0
[20] IRanges_2.28.0
[21] S4Vectors_0.32.0
[22] Biobase_2.54.0
[23] BiocGenerics_0.40.0
[24] BiocStyle_2.22.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 seqinr_4.2-8
[3] rjson_0.2.20 ellipsis_0.3.2
[5] class_7.3-19 depmixS4_1.5-0
[7] proxy_0.4-26 farver_2.1.0
[9] bit64_4.0.5 fansi_0.5.0
[11] xml2_1.3.2 cachem_1.0.6
[13] knitr_1.36 ade4_1.7-18
[15] jsonlite_1.7.2 Rsamtools_2.10.0
[17] dbplyr_2.1.1 png_0.1-7
[19] BiocManager_1.30.16 readr_2.0.2
[21] compiler_4.1.1 httr_1.4.2
[23] lazyeval_0.2.2 assertthat_0.2.1
[25] Matrix_1.3-4 fastmap_1.1.0
[27] limma_3.50.0 htmltools_0.5.2
[29] prettyunits_1.1.1 tools_4.1.1
[31] gtable_0.3.0 glue_1.4.2
[33] GenomeInfoDbData_1.2.7 reshape2_1.4.4
[35] dplyr_1.0.7 rappdirs_0.3.3
[37] Rcpp_1.0.7 jquerylib_0.1.4
[39] vctrs_0.3.8 preprocessCore_1.56.0
[41] nlme_3.1-153 xfun_0.27
[43] stringr_1.4.0 plyranges_1.14.0
[45] lifecycle_1.0.1 restfulr_0.0.13
[47] XML_3.99-0.8 zlibbioc_1.40.0
[49] MASS_7.3-54 scales_1.1.1
[51] vroom_1.5.5 ProtGenerics_1.26.0
[53] hms_1.1.1 MatrixGenerics_1.6.0
[55] parallel_4.1.1 SummarizedExperiment_1.24.0
[57] yaml_2.2.1 curl_4.3.2
[59] memoise_2.0.0 ggplot2_3.3.5
[61] sass_0.4.0 biomaRt_2.50.0
[63] stringi_1.7.5 highr_0.9
[65] BiocIO_1.4.0 e1071_1.7-9
[67] filelock_1.0.2 BiocParallel_1.28.0
[69] truncnorm_1.0-8 rlang_0.4.12
[71] pkgconfig_2.0.3 matrixStats_0.61.0
[73] bitops_1.0-7 Rsolnp_1.16
[75] evaluate_0.14 lattice_0.20-45
[77] purrr_0.3.4 labeling_0.4.2
[79] GenomicAlignments_1.30.0 bit_4.0.4
[81] tidyselect_1.1.1 plyr_1.8.6
[83] magrittr_2.0.1 bookdown_0.24
[85] R6_2.5.1 magick_2.7.3
[87] generics_0.1.1 DelayedArray_0.20.0
[89] pillar_1.6.4 KEGGREST_1.34.0
[91] RCurl_1.98-1.5 nnet_7.3-16
[93] tibble_3.1.5 crayon_1.4.1
[95] utf8_1.2.2 BiocFileCache_2.2.0
[97] tzdb_0.1.2 rmarkdown_2.11
[99] progress_1.2.2 grid_4.1.1
[101] blob_1.2.2 digest_0.6.28
[103] munsell_0.5.0 bslib_0.3.1
1. Sheppard, S., Lawson, N. D. & Zhu, L. J. Accurate identification of polyadenylation sites from 3′ end deep sequencing using a naive bayes classifier. Bioinformatics 29, 2564–2571 (2013).