BiocPkgTools 1.12.2
Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.
Functionality includes access to :
The Bioconductor build reports are available online as HTML pages.
However, they are not very computable.
The biocBuildReport
function does some heroic parsing of the HTML
to produce a tidy data.frame for further processing in R.
library(BiocPkgTools)
head(biocBuildReport())
## # A tibble: 6 × 11
## pkg author version git_last_commit git_last_commit_da… Deprecated
## <chr> <chr> <chr> <chr> <dttm> <lgl>
## 1 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## 2 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## 3 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## 4 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## 5 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## 6 ABAEnrichment Steffi G… 1.24.0 5d20752 2021-10-26 12:22:01 FALSE
## # … with 5 more variables: PackageStatus <chr>, node <chr>, stage <chr>,
## # result <chr>, bioc_version <chr>
Because developers may be interested in a quick view of their own
packages, there is a simple function, problemPage
, to produce an HTML report of
the build status of packages matching a given author regex supplied to the authorPattern
argument.
The default is to report only “problem” build statuses (ERROR, WARNING).
problemPage(authorPattern = "V.*Carey")
In similar fashion, maintainers of packages that have many downstream packages that
depend on them may wish to check that a change they introduced hasn’t suddenly broken
a large number of these. You can use the dependsOn
argument to produce the summary report
of those packages that “depend on” the given package.
problemPage(dependsOn = "limma")
When run in an interactive environment, the problemPage
function
will open a browser window for user interaction. Note that if you want
to include all your package results, not just the broken ones, simply
specify includeOK = TRUE
.
Bioconductor supplies download stats for all packages. The biocDownloadStats
function grabs all available download stats for all packages in all
Experiment Data, Annotation Data, and Software packages. The results
are returned as a tidy data.frame for further analysis.
head(biocDownloadStats())
## # A tibble: 6 × 7
## pkgType Package Year Month Nb_of_distinct_IPs Nb_of_downloads Date
## <chr> <chr> <int> <chr> <int> <int> <date>
## 1 software ABarray 2021 Jan 60 127 2021-01-01
## 2 software ABarray 2021 Feb 51 139 2021-02-01
## 3 software ABarray 2021 Mar 77 146 2021-03-01
## 4 software ABarray 2021 Apr 75 145 2021-04-01
## 5 software ABarray 2021 May 73 122 2021-05-01
## 6 software ABarray 2021 Jun 56 101 2021-06-01
The download statistics reported are for all available versions of a package. There are no separate, publicly available statistics broken down by version.
The majority of Bioconductor Software packages are also available through other channels
such as Anaconda, who also provided download statistics for packages installed from
their repositories. Access to these counts is provided by the anacondaDownloadStats
function:
head(anacondaDownloadStats())
## # A tibble: 6 × 7
## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo Date
## <chr> <chr> <chr> <int> <dbl> <chr> <date>
## 1 ABAData 2018 Apr NA 8 Anaconda 2018-04-01
## 2 ABAData 2018 Aug NA 5 Anaconda 2018-08-01
## 3 ABAData 2018 Dec NA 133 Anaconda 2018-12-01
## 4 ABAData 2018 Jul NA 6 Anaconda 2018-07-01
## 5 ABAData 2018 Jun NA 18 Anaconda 2018-06-01
## 6 ABAData 2018 Mar NA 13 Anaconda 2018-03-01
Note that Anaconda do not provide counts for distinct IP addresses, but this column is included for compatibility with the Bioconductor count tables.
The R DESCRIPTION
file contains a plethora of information regarding package
authors, dependencies, versions, etc. In a repository such as Bioconductor, these
details are available in bulk for all inclucded packages. The biocPkgList
returns
a data.frame with a row for each package. Tons of information are avaiable, as
evidenced by the column names of the results.
bpi = biocPkgList()
colnames(bpi)
## [1] "Package" "Version" "Depends"
## [4] "Suggests" "License" "MD5sum"
## [7] "NeedsCompilation" "Title" "Description"
## [10] "biocViews" "Author" "Maintainer"
## [13] "git_url" "git_branch" "git_last_commit"
## [16] "git_last_commit_date" "Date/Publication" "source.ver"
## [19] "win.binary.ver" "mac.binary.ver" "vignettes"
## [22] "vignetteTitles" "hasREADME" "hasNEWS"
## [25] "hasINSTALL" "hasLICENSE" "Rfiles"
## [28] "dependencyCount" "Imports" "Enhances"
## [31] "dependsOnMe" "VignetteBuilder" "suggestsMe"
## [34] "LinkingTo" "Archs" "URL"
## [37] "SystemRequirements" "BugReports" "importsMe"
## [40] "PackageStatus" "Video" "linksToMe"
## [43] "License_restricts_use" "organism" "OS_type"
## [46] "License_is_FOSS"
Some of the variables are parsed to produce list
columns.
head(bpi)
## # A tibble: 6 × 46
## Package Version Depends Suggests License MD5sum NeedsCompilation Title
## <chr> <chr> <list> <list> <chr> <chr> <chr> <chr>
## 1 a4 1.42.0 <chr [5]> <chr [6]> GPL-3 5addb5… no Auto…
## 2 a4Base 1.42.0 <chr [2]> <chr [4]> GPL-3 3a0b38… no Auto…
## 3 a4Classif 1.42.0 <chr [2]> <chr [4]> GPL-3 dbc971… no Auto…
## 4 a4Core 1.42.0 <chr [1]> <chr [2]> GPL-3 c24c68… no Auto…
## 5 a4Preproc 1.42.0 <chr [1]> <chr [4]> GPL-3 8bd01b… no Auto…
## 6 a4Reporting 1.42.0 <chr [1]> <chr [2]> GPL-3 fc5f30… no Auto…
## # … with 38 more variables: Description <chr>, biocViews <list>, Author <list>,
## # Maintainer <list>, git_url <chr>, git_branch <chr>, git_last_commit <chr>,
## # git_last_commit_date <chr>, Date/Publication <chr>, source.ver <chr>,
## # win.binary.ver <chr>, mac.binary.ver <chr>, vignettes <list>,
## # vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## # hasLICENSE <chr>, Rfiles <list>, dependencyCount <chr>, Imports <list>,
## # Enhances <list>, dependsOnMe <list>, VignetteBuilder <chr>, …
As a simple example of how these columns can be used, extracting
the importsMe
column to find the packages that import the
GEOquery package.
require(dplyr)
bpi = biocPkgList()
bpi %>%
filter(Package=="GEOquery") %>%
pull(importsMe) %>%
unlist()
## [1] "bigmelon" "BioPlex" "ChIPXpress"
## [4] "coexnet" "conclus" "crossmeta"
## [7] "DExMA" "EGAD" "GAPGOM"
## [10] "GEOexplorer" "MACPET" "minfi"
## [13] "MoonlightR" "phantasus" "recount"
## [16] "SRAdb" "BeadArrayUseCases" "GSE13015"
## [19] "geneExpressionFromGEO" "MetaIntegrator"
For the end user of Bioconductor, an analysis often starts with finding a
package or set of packages that perform required tasks or are tailored
to a specific operation or data type. The biocExplore()
function
implements an interactive bubble visualization with filtering based on
biocViews terms. Bubbles are sized based on download statistics. Tooltip
and detail-on-click capabilities are included. To start a local session:
biocExplore()
The Bioconductor ecosystem is built around the concept of interoperability
and dependencies. These interdependencies are available as part of the
biocPkgList()
output. The BiocPkgTools
provides some convenience
functions to convert package dependencies to R graphs. A modular approach leads
to the following workflow.
data.frame
of dependencies using buildPkgDependencyDataFrame
.igraph
object from the dependency data frame using buildPkgDependencyIgraph
igraph
functionality to perform arbitrary network operations.
Convenience functions, inducedSubgraphByPkgs
and subgraphByDegree
are available.A dependency graph for all of Bioconductor is a starting place.
library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g
## IGRAPH 202265b DN-- 3752 36416 --
## + attr: name (v/c), edgetype (e/c)
## + edges from 202265b (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc
## [3] a4 ->a4Classif a4 ->a4Core
## [5] a4 ->a4Reporting a4Base ->a4Preproc
## [7] a4Base ->a4Core a4Classif ->a4Core
## [9] a4Classif ->a4Preproc ABAEnrichment->R
## [11] abseqR ->R ABSSeq ->R
## [13] ABSSeq ->methods acde ->R
## [15] acde ->boot ACE ->R
## + ... omitted several edges
library(igraph)
head(V(g))
## + 6/3752 vertices, named, from 202265b:
## [1] a4 a4Base a4Classif ABAEnrichment abseqR
## [6] ABSSeq
head(E(g))
## + 6/36416 edges from 202265b (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc a4 ->a4Classif
## [4] a4 ->a4Core a4 ->a4Reporting a4Base->a4Preproc
See inducedSubgraphByPkgs
and subgraphByDegree
to produce
subgraphs based on a subset of packages.
See the igraph documentation for more detail on graph analytics, setting vertex and edge attributes, and advanced subsetting.
The visNetwork package is a nice interactive visualization tool that implements graph plotting in a browser. It can be integrated into shiny applications. Interactive graphs can also be included in Rmarkdown documents (see vignette)
igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())
The full dependency graph is really not that informative to look at, though doing so is possible. A common use case is to visualize the graph of dependencies “centered” on a package of interest. In this case, I will focus on the GEOquery package.
igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")
The subgraphByDegree()
function returns all nodes and connections within
degree
of the named package; the default degree
is 1
.
The visNework package can plot igraph
objects directly, but more flexibility
is offered by first converting the graph to visNetwork form.
library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)
The next few code chunks highlight just a few examples of the visNetwork capabilities, starting with a basic plot.
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")
For fun, we can watch the graph stabilize during drawing, best viewed interactively.
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visPhysics(stabilization=FALSE)
Add arrows and colors to better capture dependencies.
data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visEdges(arrows='from')
Add a legend.
ledges <- data.frame(color = c("green", "lightblue", "red"),
label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visEdges(arrows='from') %>%
visLegend(addEdges=ledges)
[Work in progress]
The biocViews package is a small ontology of terms describing Bioconductor packages. This is a work-in-progress section, but here is a small example of plotting the biocViews graph.
library(biocViews)
data(biocViewsVocab)
biocViewsVocab
## A graphNEL graph with directed edges
## Number of Nodes = 494
## Number of Edges = 493
library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
visNodes(size=20) %>%
visPhysics(stabilization=FALSE)
The dependency burden of a package, namely the amount of functionality that
a given package is importing, is an important parameter to take into account
during package development. A package may break because one or more of its
dependencies have changed the part of the API our package is importing or
this part has even broken. For this reason, it may be useful for package
developers to quantify the dependency burden of a given package. To do that
we should first gather all dependency information using the function
buildPkgDependencyDataFrame()
but setting the arguments to work with
packages in Bioconductor and CRAN and dependencies categorised as Depends
or Imports
, which are the ones installed by default for a given package.
library(BiocPkgTools)
depdf <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"),
dependencies=c("Depends", "Imports"))
depdf
## # A tibble: 126,690 × 3
## Package dependency edgetype
## <chr> <chr> <chr>
## 1 a4 a4Base Depends
## 2 a4 a4Preproc Depends
## 3 a4 a4Classif Depends
## 4 a4 a4Core Depends
## 5 a4 a4Reporting Depends
## 6 a4Base a4Preproc Depends
## 7 a4Base a4Core Depends
## 8 a4Classif a4Core Depends
## 9 a4Classif a4Preproc Depends
## 10 ABAEnrichment R Depends
## # … with 126,680 more rows
Finally, we call the function pkgDepMetrics()
to obtain different metrics
on the dependency burden of a package we want to analyze, in the case below,
the package BiocPkgTools
itself:
pkgDepMetrics("BiocPkgTools", depdf)
## ImportedAndUsed Exported Usage DepOverlap DepGainIfExcluded
## utils 1 217 0.46 0.01 0
## rlang 4 476 0.84 0.02 0
## graph 1 116 0.86 0.07 0
## igraph 9 784 1.15 0.12 4
## RBGL 1 77 1.30 0.08 0
## htmltools 1 75 1.33 0.08 0
## xml2 1 66 1.52 0.01 0
## tidyr 1 62 1.61 0.24 1
## tools 2 118 1.69 0.01 0
## magrittr 1 41 2.44 0.01 0
## DT 1 39 2.56 0.23 6
## dplyr 9 285 3.16 0.23 0
## httr 3 91 3.30 0.10 0
## tidyselect 1 25 4.00 0.09 0
## tibble 2 44 4.55 0.17 0
## jsonlite 1 17 5.88 0.01 0
## htmlwidgets 1 14 7.14 0.12 0
## rvest 3 40 7.50 0.35 2
## gh 1 10 10.00 0.17 3
## stringr 5 49 10.20 0.08 0
## BiocManager 1 5 20.00 0.02 0
## BiocFileCache NA 29 NA 0.52 10
## biocViews NA 31 NA 0.17 6
## readr NA 114 NA 0.33 5
In this resulting table, rows correspond to dependencies and columns provide the following information:
ImportedAndUsed
: number of functionality calls imported and used in
the package.Exported
: number of functionality calls exported by the dependency.Usage
: (ImportedAndUsed
x 100) / Exported
. This value provides an
estimate of what fraction of the functionality of the dependency is
actually used in the given package.DepOverlap
: Similarity between the dependency graph structure of the
given package and the one of the dependency in the corresponding row,
estimated as the Jaccard index
between the two sets of vertices of the corresponding graphs. Its values
goes between 0 and 1, where 0 indicates that no dependency is shared, while
1 indicates that the given package and the corresponding dependency depend
on an identical subset of packages.DepGainIfExcluded
: The ‘dependency gain’ (decrease in the total number
of dependencies) that would be obtained if this package was excluded
from the list of direct dependencies.The reported information is ordered by the Usage
column to facilitate the
identification of dependencies for which the analyzed package is using a small
fraction of their functionality and therefore, it could be easier remove them.
To aid in that decision, the column DepOverlap
reports the overlap of the
dependency graph of each dependency with the one of the analyzed package. Here
a value above, e.g., 0.5, could, albeit not necessarily, imply that removing
that dependency could substantially lighten the dependency burden of the analyzed
package.
An NA
value in the ImportedAndUsed
column indicates that the function
pkgDepMetrics()
could not identify what functionality calls in the analyzed
package are made to the dependency. This may happen because pkgDepMetrics()
has failed to identify the corresponding calls, as it happens with imported
built-in constants such as DNA_BASES
from Biostrings
, or that although the
given package is importing that dependency, none of its functionality is actually
being used. In such a case, this dependency could be safely removed without
any further change in the analyzed package.
We can find out what actually functionality calls are we importing as follows:
imp <- pkgDepImports("BiocPkgTools")
imp %>% filter(pkg == "DT")
## # A tibble: 1 × 2
## pkg fun
## <chr> <chr>
## 1 DT datatable
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biocViews_1.62.1 visNetwork_2.1.0 igraph_1.2.7
## [4] dplyr_1.0.7 BiocPkgTools_1.12.2 htmlwidgets_1.5.4
## [7] knitr_1.36 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.28 bslib_0.3.1
## [4] purrr_0.3.4 vctrs_0.3.8 generics_0.1.1
## [7] BiocFileCache_2.2.0 htmltools_0.5.2 stats4_4.1.1
## [10] yaml_2.2.1 blob_1.2.2 utf8_1.2.2
## [13] RBGL_1.70.0 XML_3.99-0.8 rlang_0.4.12
## [16] jquerylib_0.1.4 pillar_1.6.4 glue_1.4.2
## [19] DBI_1.1.1 rappdirs_0.3.3 dbplyr_2.1.1
## [22] bit64_4.0.5 BiocGenerics_0.40.0 lifecycle_1.0.1
## [25] stringr_1.4.0 rvest_1.0.2 memoise_2.0.0
## [28] evaluate_0.14 Biobase_2.54.0 tzdb_0.2.0
## [31] fastmap_1.1.0 curl_4.3.2 RUnit_0.4.32
## [34] fansi_0.5.0 Rcpp_1.0.7 readr_2.0.2
## [37] filelock_1.0.2 DT_0.19 BiocManager_1.30.16
## [40] cachem_1.0.6 graph_1.72.0 jsonlite_1.7.2
## [43] bit_4.0.4 hms_1.1.1 digest_0.6.28
## [46] stringi_1.7.5 bookdown_0.24 gh_1.3.0
## [49] cli_3.1.0 tools_4.1.1 bitops_1.0-7
## [52] magrittr_2.0.1 sass_0.4.0 RSQLite_2.2.8
## [55] RCurl_1.98-1.5 tibble_3.1.5 tidyr_1.1.4
## [58] crayon_1.4.2 pkgconfig_2.0.3 ellipsis_0.3.2
## [61] xml2_1.3.2 assertthat_0.2.1 rmarkdown_2.11
## [64] httr_1.4.2 R6_2.5.1 compiler_4.1.1