The countsimQC
package provides a simple way to compare the characteristic features of a collection of (e.g., RNA-seq) count data sets. An important application is in situations where a synthetic count data set has been generated using a real count data set as an underlying source of parameters, in which case it is often important to verify that the final synthetic data captures the main features of the original data set. However, the package can be used to create a visual overview of any collection of one or more count data sets.
In this vignette we will show how to generate a comparative report from a collection of two simulated data sets and the original, underlying real data set. First, we load the object containing the three data sets. The object is a named list, where each element is a DESeqDataSet
object, containing the count matrix, a sample information data frame and a model formula (necessary to calculate dispersions). For more information about the DESeqDataSet
class, please see the DESeq2
Bioconductor package.
suppressPackageStartupMessages({
library(countsimQC)
library(DESeq2)
})
data(countsimExample)
countsimExample
## $Original
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): ENSMUSG00000000001.4 ENSMUSG00000000028.14 ...
## ENSMUSG00000048027.7 ENSMUSG00000048029.10
## rowData names(0):
## colnames(11): GSM1923445 GSM1923446 ... GSM1923578 GSM1923579
## colData names(2): group sample
##
## $Sim1
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(4): Cell Batch Group ExpLibSize
##
## $Sim2
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(3): Cell CellFac Group
Next, we generate the report using the countsimQCReport()
function. Depending on the level of detail and the type of information that are required for the final report, this function can be run in different “modes”:
calculateStatistics = FALSE
, only plots will be generated. This is the fastest way of running countsimQCReport()
, and in many cases generates enough information for the user to make a visual evaluation of the count data set(s).calculateStatistics = TRUE
and permutationPvalues = FALSE
, some quantitative pairwise comparisons between data sets will be performed. In particular, the Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test will be used to compare distributions, and additional statistics will be calculated to evaluate how similar the evaluated aspects are between pairs of data sets.calculateStatistics = TRUE
and permutationPvalues = TRUE
(and giving the requested number of permutations via the nPermutations
argument), permutation of data set labels will be used to evaluate the significance of the statistics calculated in the previous point. Naturally, this increases the run time of the analysis considerably.Here, for the sake of speed, we calculate statistics for a small subset of the observations (subsampleSize = 25
) and refrain from calculating permutation p-values.
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html",
outputDir = tempDir, outputFormat = "html_document",
showCode = FALSE, forceOverwrite = TRUE,
savePlots = TRUE, description = "This is my test report.",
maxNForCorr = 25, maxNForDisp = Inf,
calculateStatistics = TRUE, subsampleSize = 25,
kfrac = 0.01, kmin = 5,
permutationPvalues = FALSE, nPermutations = NULL)
The countsimQCReport()
function can generate either an HTML file (by setting outputFormat = "html_document"
or outputFormat = NULL
) or a pdf file (by setting outputFormat = "pdf_document"
). The description
argument can be used to provide a more extensive description of the data set(s) that are included in the report.
If the argument savePlots
is set to TRUE, an .rds file containing the individual ggplot objects will be generated. These objects can be used to perform fine-tuning of the visualizations if desired. Note, however, that the .rds file can become large if the number of data sets is large, or if the individual data sets have many samples or features. The convenience function generateIndividualPlots()
can be used to quickly generate individual figures for all plots included in the report, using a variety of devices. For example, to generate each plot in pdf format:
In the example above, all data sets were provided as DESeqDataSet
objects. The advantage of this is that it allows the specification of the experimental design, which is used in the dispersion calculations. countsimQC
also allows a data set to be provided as either a data.frame
or a matrix
. However, in these situations, it will be assumed that all samples are replicates (i.e., a design ~1
). An example is provided in the countsimExample_dfmat
data set, provided with the package.
## [1] "Original" "Sim1" "Sim2"
## $Original
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
##
## $Sim1
## [1] "matrix"
##
## $Sim2
## [1] "data.frame"
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample_dfmat,
outputFile = "countsim_report_dfmat.html",
outputDir = tempDir, outputFormat = "html_document",
showCode = FALSE, forceOverwrite = TRUE,
savePlots = TRUE, description = "This is my test report.",
maxNForCorr = 25, maxNForDisp = Inf,
calculateStatistics = TRUE, subsampleSize = 25,
kfrac = 0.01, kmin = 5,
permutationPvalues = FALSE, nPermutations = NULL)
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DESeq2_1.24.0 SummarizedExperiment_1.14.0
## [3] DelayedArray_0.10.0 BiocParallel_1.18.0
## [5] matrixStats_0.54.0 Biobase_2.44.0
## [7] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [9] IRanges_2.18.0 S4Vectors_0.22.0
## [11] BiocGenerics_0.30.0 countsimQC_1.2.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
## [4] tools_3.6.0 backports_1.1.4 R6_2.4.0
## [7] DT_0.5 rpart_4.1-15 Hmisc_4.2-0
## [10] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
## [13] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5
## [16] gridExtra_2.3 bit_1.1-14 compiler_3.6.0
## [19] htmlTable_1.13.1 randtests_1.0 labeling_0.3
## [22] caTools_1.17.1.2 scales_1.0.0 checkmate_1.9.1
## [25] genefilter_1.66.0 stringr_1.4.0 digest_0.6.18
## [28] foreign_0.8-71 rmarkdown_1.12 XVector_0.24.0
## [31] base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6
## [34] limma_3.40.0 htmlwidgets_1.3 rlang_0.3.4
## [37] rstudioapi_0.10 RSQLite_2.1.1 shiny_1.3.2
## [40] jsonlite_1.6 crosstalk_1.0.0 acepack_1.4.1
## [43] dplyr_0.8.0.1 RCurl_1.95-4.12 magrittr_1.5
## [46] GenomeInfoDbData_1.2.1 Formula_1.2-3 Matrix_1.2-17
## [49] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3
## [52] yaml_2.2.0 edgeR_3.26.0 zlibbioc_1.30.0
## [55] plyr_1.8.4 grid_3.6.0 blob_1.1.1
## [58] promises_1.0.1 crayon_1.3.4 lattice_0.20-38
## [61] splines_3.6.0 annotate_1.62.0 locfit_1.5-9.1
## [64] knitr_1.22 pillar_1.3.1 geneplotter_1.62.0
## [67] XML_3.98-1.19 glue_1.3.1 evaluate_0.13
## [70] latticeExtra_0.6-28 data.table_1.12.2 httpuv_1.5.1
## [73] gtable_0.3.0 purrr_0.3.2 tidyr_0.8.3
## [76] assertthat_0.2.1 ggplot2_3.1.1 xfun_0.6
## [79] mime_0.6 xtable_1.8-4 later_0.8.0
## [82] survival_2.44-1.1 tibble_2.1.1 AnnotationDbi_1.46.0
## [85] memoise_1.1.0 cluster_2.0.9