BiocSet 1.8.1
BiocSet
is a package that represents element sets in a tibble format with the
BiocSet
class. Element sets are read in and converted into a tibble format.
From here, typical dplyr
operations can be performed on the element set.
BiocSet
also provides functionality for mapping different ID types and
providing reference urls for elements and sets.
Install the most recent version from Bioconductor:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BiocSet")
The development version is also available for install from GitHub:
BiocManager::install("Kayla-Morrell/BiocSet")
Then load BiocSet
:
library(BiocSet)
BiocSet
can create a BiocSet
object using two different input methods. The
first is to input named character vectors of element sets. BiocSet
returns
three tibbles, es_element
which contains the elements, es_set
which contains
the sets and es_elementset
which contains elements and sets together.
tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 52 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 49 more rows
#>
#> es_set():
#> # A tibble: 2 × 1
#> set
#> <chr>
#> 1 set1
#> 2 set2
#>
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#> element set
#> <chr> <chr>
#> 1 a set1
#> 2 b set1
#> 3 c set1
#> # … with 49 more rows
The second method of creating a BiocSet
object would be to read in a .gmt
file. Using import()
, a path to a downloaded .gmt
file is read in and a
BiocSet
object is returned. The example below uses a hallmark element set
downloaded from GSEA, which is also included with this package. This
BiocSet
includes a source
column within the es_elementset
tibble for
reference as to where the element set came from.
gmtFile <- system.file(package = "BiocSet",
"extdata",
"hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 4,386 × 1
#> element
#> <chr>
#> 1 JUNB
#> 2 CXCL2
#> 3 ATF3
#> # … with 4,383 more rows
#>
#> es_set():
#> # A tibble: 50 × 2
#> set source
#> <chr> <chr>
#> 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.broadinstitute.org/gsea/msigdb/ca…
#> 2 HALLMARK_HYPOXIA http://www.broadinstitute.org/gsea/msigdb/ca…
#> 3 HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.broadinstitute.org/gsea/msigdb/ca…
#> # … with 47 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 7,324 × 2
#> element set
#> <chr> <chr>
#> 1 JUNB HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 2 CXCL2 HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 3 ATF3 HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> # … with 7,321 more rows
export()
allows for a BiocSet
object to be exported into a temporary file
with the extention .gmt
.
fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt
#> GMTFile object
#> resource: /tmp/Rtmp4yREKW/file38cff31d92786.gmt
We also support importing files to create an OBOSet
object which extends
the BiocSet
class. The default only displays the most basic amount of columns
needed to perform certain tasks, but the argument extract_tag = everything
allows for additional columns to be imported as well. The following workflow
will demonstrate the use of both export and import to create a smaller subset
of the large obo file that is accessable from GeneOntology.
download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo")
foo <- import("obo_file.obo", extract_tag = "everything")
small_tst <- es_element(foo)[1,] %>%
unnest("ancestors") %>%
select("element", "ancestors") %>%
unlist() %>%
unique()
small_oboset <- foo %>% filter_elementset(element %in% small_tst)
fl <- tempfile(fileext = ".obo")
export(small_oboset, fl)
Then you can import this smaller obo file which we utilize here, in tests, and example code.
oboFile <- system.file(package = "BiocSet",
"extdata",
"sample_go.obo")
obo <- import(oboFile)
obo
#> class: OBOSet
#>
#> es_element():
#> # A tibble: 8 × 3
#> element name obsolete
#> <chr> <chr> <lgl>
#> 1 GO:0033955 mitochondrial DNA inheritance FALSE
#> 2 GO:0000002 mitochondrial genome maintenance FALSE
#> 3 GO:0007005 mitochondrion organization FALSE
#> # … with 5 more rows
#>
#> es_set():
#> # A tibble: 8 × 2
#> set name
#> <chr> <chr>
#> 1 GO:0000002 mitochondrial genome maintenance
#> 2 GO:0006996 organelle organization
#> 3 GO:0007005 mitochondrion organization
#> # … with 5 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 36 × 3
#> element set is_a
#> <chr> <chr> <lgl>
#> 1 GO:0033955 GO:0000002 TRUE
#> 2 GO:0000002 GO:0000002 FALSE
#> 3 GO:0007005 GO:0006996 TRUE
#> # … with 33 more rows
A feature available to BiocSet
is the ability to activate different
tibbles to perform certain functions on. When a BiocSet
is created, the tibble
es_elementset
is automatically activated and all functions will be performed
on this tibble. BiocSet
adopts the use of many common dplyr
functions such
as filter()
, select()
, mutate()
, summarise()
, and arrange()
. With
each of the functions the user is able to pick a different tibble to activate
and work on by using ‘verb_tibble’. After the function is executed than the
‘active’ tibble is returned back to the tibble that was active before the
function call. Some examples are shown below of how these functions work.
tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 52 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 49 more rows
#>
#> es_set():
#> # A tibble: 2 × 1
#> set
#> <chr>
#> 1 set1
#> 2 set2
#>
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#> element set
#> <chr> <chr>
#> 1 a set1
#> 2 b set1
#> 3 c set1
#> # … with 49 more rows
tbl %>% filter_element(element == "a" | element == "A")
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 2 × 1
#> element
#> <chr>
#> 1 a
#> 2 A
#>
#> es_set():
#> # A tibble: 2 × 1
#> set
#> <chr>
#> 1 set1
#> 2 set2
#>
#> es_elementset() <active>:
#> # A tibble: 2 × 2
#> element set
#> <chr> <chr>
#> 1 a set1
#> 2 A set2
tbl %>% mutate_set(pval = rnorm(1:2))
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 52 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 49 more rows
#>
#> es_set():
#> # A tibble: 2 × 2
#> set pval
#> <chr> <dbl>
#> 1 set1 0.207
#> 2 set2 0.600
#>
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#> element set
#> <chr> <chr>
#> 1 a set1
#> 2 b set1
#> 3 c set1
#> # … with 49 more rows
tbl %>% arrange_elementset(desc(element))
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 52 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 49 more rows
#>
#> es_set():
#> # A tibble: 2 × 1
#> set
#> <chr>
#> 1 set1
#> 2 set2
#>
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#> element set
#> <chr> <chr>
#> 1 z set1
#> 2 y set1
#> 3 x set1
#> # … with 49 more rows
BiocSet
also allows for common set operations to be performed on the BiocSet
object. union()
and intersection()
are the two set operations available in
BiocSet
. We demonstate how a user can find the union between two BiocSet
objects or within a single BiocSet
object. Intersection is used in the same
way.
# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 8 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 5 more rows
#>
#> es_set():
#> # A tibble: 2 × 1
#> set
#> <chr>
#> 1 set1
#> 2 set2
#>
#> es_elementset() <active>:
#> # A tibble: 8 × 2
#> element set
#> <chr> <chr>
#> 1 a set1
#> 2 b set1
#> 3 c set1
#> # … with 5 more rows
# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 20 × 1
#> element
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> # … with 17 more rows
#>
#> es_set():
#> # A tibble: 1 × 1
#> set
#> <chr>
#> 1 union
#>
#> es_elementset() <active>:
#> # A tibble: 20 × 2
#> element set
#> <chr> <chr>
#> 1 a union
#> 2 b union
#> 3 c union
#> # … with 17 more rows
Next, we demonstrate the a couple uses of BiocSet
with an experiment dataset
airway
from the package airway
. This data is from an RNA-Seq experiment on
airway smooth muscle (ASM) cell lines.
The first step is to load the library and the necessary data.
library(airway)
data("airway")
se <- airway
The function go_sets()
discovers the keys from the org object and uses
AnnotationDbi::select
to create a mapping to GO ids. go_sets()
also allows
the user to indicate which evidence type or ontology type they would like when
selecting the GO ids. The default is using all evidence types and all ontology
types. We represent these identifieres as a BiocSet
object. Using the
go_sets
function we are able to map the Ensembl ids and GO ids from the genome
wide annotation for Human data in the org.Hs.eg.db
package. The Ensembl ids
are treated as elements while the GO ids are treated as sets.
library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 23,067 × 1
#> element
#> <chr>
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 1
#> set
#> <chr>
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows
# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 17,796 × 1
#> element
#> <chr>
#> 1 ENSG00000151729
#> 2 ENSG00000168685
#> 3 ENSG00000114030
#> # … with 17,793 more rows
#>
#> es_set():
#> # A tibble: 4,306 × 2
#> set evidence
#> <chr> <named list>
#> 1 GO:0000002 <chr [1]>
#> 2 GO:0000018 <chr [1]>
#> 3 GO:0000019 <chr [1]>
#> # … with 4,303 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 56,928 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000168685 GO:0000018
#> 3 ENSG00000114030 GO:0000018
#> # … with 56,925 more rows
Some users may not be interested in reporting the non-descriptive elements. We
demonstrate subsetting the airway
data to include non-zero assays and then
filter out the non-descriptive elements.
se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 16,633 × 1
#> element
#> <chr>
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 16,630 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 1
#> set
#> <chr>
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 252,331 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 252,328 more rows
It may also be of interest to users to know how many elements are in each set.
Using the count
function we are able to calculate the elements per set.
go %>% group_by(set) %>% dplyr::count()
#> Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> Please use the `.add` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 18,593 × 2
#> # Groups: set [18,593]
#> set n
#> <chr> <int>
#> 1 GO:0000002 12
#> 2 GO:0000003 4
#> 3 GO:0000009 2
#> 4 GO:0000010 2
#> 5 GO:0000012 10
#> 6 GO:0000014 9
#> 7 GO:0000015 4
#> 8 GO:0000016 1
#> 9 GO:0000017 2
#> 10 GO:0000018 6
#> # … with 18,583 more rows
It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.
drop <- es_activate(go, elementset) %>% group_by(set) %>%
dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 23,067 × 1
#> element
#> <chr>
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 1
#> set
#> <chr>
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows
To simplify mapping we created a couple map
functions. map_unique()
is used
when there is known 1:1 mapping, This takes four arguements, a BiocSet
object, an AnnotationDbi
object, the id to map from, and the id to map to.
map_multiple()
needs a fifth argument to indicate how the function should
treat an element when there is multiple mapping. Both functions utilize
mapIds
from AnnotationDbi
and return a BiocSet
object. In the example
below we show how to use map_unique
to map go
’s ids from Ensembl to gene
symbols.
go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
#> 'select()' returned 1:many mapping between keys and columns
#> Joining, by = "element"
#> `mutate_if()` ignored the following grouping variables:
#> Column `element`
#> Joining, by = "element"
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 20,316 × 1
#> element
#> <chr>
#> 1 AKT3
#> 2 LONP1
#> 3 MEF2A
#> # … with 20,313 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 1
#> set
#> <chr>
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 285,761 × 2
#> element set
#> <chr> <chr>
#> 1 AKT3 GO:0000002
#> 2 LONP1 GO:0000002
#> 3 MEF2A GO:0000002
#> # … with 285,758 more rows
Another functionality of BiocSet
is the ability to add information to the
tibbles. Using the GO.db
library we are able to map definitions to the GO
ids. From there we can add the mapping to the tibble using map_add()
and the
mutate function.
library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 23,067 × 1
#> element
#> <chr>
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 2
#> set definition
#> <chr> <chr>
#> 1 GO:0000002 The maintenance of the structure and integrity of the mitochondria…
#> 2 GO:0000003 The production of new individuals that contain some portion of gen…
#> 3 GO:0000009 Catalysis of the transfer of a mannose residue to an oligosacchari…
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows
The library KEGGREST
is a client interface to the KEGG REST server.
KEGG contains pathway maps that represent interaction, reaction and relation
networks for various biological processes and diseases. BiocSet
has a
function that utilizes KEGGREST
to develop a BiocSet
object that contains
the elements for every pathway map in KEGG.
Due to limiations of the KEGGREST package, kegg_sets
can take some time to
run depending on the amount of pathways for the species of interest. Therefore
we demonstrate using BiocFileCache
to make the data available to the user.
library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
kegg <- kegg_sets("hsa")
fl <- bfcnew(rname = rname, ext = ".gmt")
export(kegg_sets("hsa"), fl)
}
kegg <- import(bfcrpath(rname=rname))
Within the kegg_sets()
function we remove pathways that do not contain any
elements. We then mutate the element tibble using the map_add
function to
contain both Ensembl and Entrez ids.
map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
kegg <- kegg %>% mutate_element(ensembl = map)
Since we are working with ASM data we thought we would subset the airway
data
to contain only the elements in the asthma pathway. This filter is performed
on the KEGG id, which for asthma is “hsa05310”.
asthma <- kegg %>% filter_set(set == "hsa05310")
se <- se[rownames(se) %in% es_element(asthma)$ensembl,]
se
#> class: RangedSummarizedExperiment
#> dim: 7843 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(7843): ENSG00000000419 ENSG00000000938 ... ENSG00000273079
#> ENSG00000273085
#> rowData names(0):
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample
The filtering can also be done for multiple pathways.
pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)
multipaths
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 8,092 × 2
#> element ensembl
#> <chr> <chr>
#> 1 3101 ENSG00000160883
#> 2 3098 ENSG00000156515
#> 3 3099 ENSG00000159399
#> # … with 8,089 more rows
#>
#> es_set():
#> # A tibble: 4 × 2
#> set source
#> <chr> <chr>
#> 1 hsa04110 <NA>
#> 2 hsa04970 <NA>
#> 3 hsa05224 <NA>
#> # … with 1 more row
#>
#> es_elementset() <active>:
#> # A tibble: 395 × 2
#> element set
#> <chr> <chr>
#> 1 595 hsa04110
#> 2 894 hsa04110
#> 3 896 hsa04110
#> # … with 392 more rows
This example will start out the same way that Case Study I started, by loading
in the airway
dataset, but we will also do some reformating of the data. The
end goal is to be able to perform a Gene Set Enrichment Analysis on the data and
return a BiocSet object of the gene sets.
data("airway")
airway$dex <- relevel(airway$dex, "untrt")
Similar to other analyses we perform a differential expression analysis on the
airway
data using the library DESeq2
and then store the results into a
tibble.
library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
res <- results(des)
tbl <- res %>%
as.data.frame() %>%
as_tibble(rownames = "ENSEMBL")
Since we want to use limma::goana()
to perform the GSEA we will need to have
ENTREZ identifiers in the data, as well as filter out some NA
information.
Later on this will be considered our ‘element’ tibble.
tbl <- tbl %>%
mutate(
ENTREZID = mapIds(
org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
) %>% unname()
)
#> 'select()' returned 1:many mapping between keys and columns
tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl
#> # A tibble: 15,115 × 8
#> ENSEMBL baseMean log2FoldChange lfcSE stat pvalue padj ENTREZID
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 ENSG0000000… 709. -0.381 0.101 -3.79 1.52e-4 1.28e-3 7105
#> 2 ENSG0000000… 520. 0.207 0.112 1.84 6.53e-2 1.97e-1 8813
#> 3 ENSG0000000… 237. 0.0379 0.143 0.264 7.92e-1 9.11e-1 57147
#> 4 ENSG0000000… 57.9 -0.0882 0.287 -0.307 7.59e-1 8.95e-1 55732
#> 5 ENSG0000000… 5817. 0.426 0.0883 4.83 1.38e-6 1.82e-5 3075
#> 6 ENSG0000000… 1282. -0.241 0.0887 -2.72 6.58e-3 3.28e-2 2519
#> 7 ENSG0000000… 610. -0.0476 0.167 -0.286 7.75e-1 9.03e-1 2729
#> 8 ENSG0000000… 369. -0.500 0.121 -4.14 3.48e-5 3.42e-4 4800
#> 9 ENSG0000000… 183. -0.124 0.180 -0.689 4.91e-1 7.24e-1 90529
#> 10 ENSG0000000… 2814. -0.0411 0.103 -0.400 6.89e-1 8.57e-1 57185
#> # … with 15,105 more rows
Now that the data is ready for GSEA we can go ahead and use goana()
and make
the results into a tibble. This tibble will be considered our ‘set’ tibble.
library(limma)
#>
#> Attaching package: 'limma'
#> The following object is masked from 'package:DESeq2':
#>
#> plotMA
#> The following object is masked from 'package:BiocGenerics':
#>
#> plotMA
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
as.data.frame() %>%
as_tibble(rownames = "GOALL")
go_ids
#> # A tibble: 21,282 × 6
#> GOALL Term Ont N DE P.DE
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 GO:0008150 biological_process BP 12328 3396 2.33e-58
#> 2 GO:0001867 complement activation, lectin pathway BP 5 3 1.03e- 1
#> 3 GO:0001868 regulation of complement activation, l… BP 2 1 4.37e- 1
#> 4 GO:0001869 negative regulation of complement acti… BP 2 1 4.37e- 1
#> 5 GO:0002252 immune effector process BP 351 114 8.59e- 4
#> 6 GO:0002253 activation of immune response BP 177 62 1.75e- 3
#> 7 GO:0002376 immune system process BP 1627 555 1.64e-18
#> 8 GO:0002682 regulation of immune system process BP 888 320 2.94e-14
#> 9 GO:0002683 negative regulation of immune system p… BP 280 106 1.05e- 6
#> 10 GO:0002684 positive regulation of immune system p… BP 518 187 7.07e- 9
#> # … with 21,272 more rows
The last thing we need to do is create a tibble that we will consider our ‘elementset’ tibble. This tibble will be a mapping of all the elements and sets.
foo <- AnnotationDbi::select(
org.Hs.eg.db,
tbl$ENTREZID,
"GOALL",
"ENTREZID") %>% as_tibble()
#> 'select()' returned many:many mapping between keys and columns
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo
#> # A tibble: 1,055,680 × 2
#> ENTREZID GOALL
#> <chr> <chr>
#> 1 7105 GO:0002221
#> 2 7105 GO:0002376
#> 3 7105 GO:0002682
#> 4 7105 GO:0002683
#> 5 7105 GO:0002753
#> 6 7105 GO:0002764
#> 7 7105 GO:0002831
#> 8 7105 GO:0002832
#> 9 7105 GO:0006950
#> 10 7105 GO:0006952
#> # … with 1,055,670 more rows
The function BiocSet_from_elementset()
allows for users to create a BiocSet
object from tibbles. This function is helpful when there is metadata contained
in the tibble that should be in the BiocSet object. For this function to work
properly, the columns that are being joined on need to be named correctly. For
instance, in order to use this function on the tibbles we created we need to
change the column in the ‘element’ tibble to ‘element’, the column in the ‘set’
tibble to ‘set’ and the same will be for the ‘elementset’ tibble. We demonstrate
this below and then create the BiocSet object with the simple function.
foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
#> more elements in 'element' than in 'elementset'
#> more elements in 'set' than in 'elementset'
es
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 12,336 × 8
#> element ENSEMBL baseMean log2FoldChange lfcSE stat pvalue padj
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3980 ENSG00000005156 206. -0.393 0.172 -2.29 2.23e- 2 8.66e-2
#> 2 1890 ENSG00000025708 145. 1.23 0.199 6.18 6.58e-10 1.49e-8
#> 3 4205 ENSG00000068305 1566. 0.529 0.0919 5.76 8.45e- 9 1.64e-7
#> # … with 12,333 more rows
#>
#> es_set():
#> # A tibble: 14,912 × 6
#> set Term Ont N DE P.DE
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 GO:0000002 mitochondrial genome maintenance BP 20 6 0.383
#> 2 GO:0000003 reproduction BP 986 307 0.00000386
#> 3 GO:0000012 single strand break repair BP 11 1 0.958
#> # … with 14,909 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 1,055,680 × 2
#> element set
#> <chr> <chr>
#> 1 3980 GO:0000002
#> 2 1890 GO:0000002
#> 3 4205 GO:0000002
#> # … with 1,055,677 more rows
For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.
tibble_from_element(es)
#> Joining, by = "set"
#> Joining, by = "element"
#> # A tibble: 12,328 × 14
#> element set Term Ont N DE P.DE ENSEMBL baseMean log2FoldChange
#> <chr> <lis> <lis> <lis> <lis> <lis> <lis> <list> <list> <list>
#> 1 1 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [1]>
#> 2 100 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [4… <dbl [489]>
#> 3 1000 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [2… <dbl [222]>
#> 4 10000 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [156]>
#> 5 10001 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [7… <dbl [78]>
#> 6 10003 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [9… <dbl [9]>
#> 7 10004 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [19]>
#> 8 100048912 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [3… <dbl [36]>
#> 9 10005 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [122]>
#> 10 10006 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [139]>
#> # … with 12,318 more rows, and 4 more variables: lfcSE <list>, stat <list>,
#> # pvalue <list>, padj <list>
head(data.frame_from_elementset(es))
#> Joining, by = "set"
#> Joining, by = "element"
#> element set Term Ont N DE P.DE
#> 1 3980 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> 2 1890 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> 3 4205 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> 4 10891 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> 5 55186 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> 6 4358 GO:0000002 mitochondrial genome maintenance BP 20 6 0.3825257
#> ENSEMBL baseMean log2FoldChange lfcSE stat pvalue
#> 1 ENSG00000005156 205.6627 -0.39272067 0.17180717 -2.285822 2.226466e-02
#> 2 ENSG00000025708 144.6965 1.23104228 0.19933257 6.175821 6.582046e-10
#> 3 ENSG00000068305 1565.5675 0.52905471 0.09186326 5.759155 8.453618e-09
#> 4 ENSG00000109819 152.4021 -0.23932270 0.17393025 -1.375969 1.688311e-01
#> 5 ENSG00000114120 2065.4720 0.14500117 0.09361789 1.548862 1.214150e-01
#> 6 ENSG00000115204 1091.6398 -0.09563418 0.09191349 -1.040480 2.981168e-01
#> padj
#> 1 8.661787e-02
#> 2 1.494473e-08
#> 3 1.636969e-07
#> 4 3.845953e-01
#> 5 3.059216e-01
#> 6 5.485252e-01
A final feature to BiocSet
is the ability to add reference information about
all of the elements/sets. A user could utilize the function url_ref()
to add
information to the BiocSet
object. If a user has a question about a specific
id then they can follow the reference url to get more informtion. Below is an
example of using url_ref()
to add reference urls to the go
data set we
worked with above.
url_ref(go)
#> class: BiocSet
#>
#> es_element():
#> # A tibble: 23,067 × 2
#> element url
#> <chr> <chr>
#> 1 ENSG00000151729 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 2 ENSG00000025708 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 3 ENSG00000068305 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> # … with 23,064 more rows
#>
#> es_set():
#> # A tibble: 18,593 × 2
#> set url
#> <chr> <chr>
#> 1 GO:0000002 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000002
#> 2 GO:0000003 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000003
#> 3 GO:0000009 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000009
#> # … with 18,590 more rows
#>
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#> element set
#> <chr> <chr>
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] limma_3.50.0 tibble_3.1.5
#> [3] DESeq2_1.34.0 BiocFileCache_2.2.0
#> [5] dbplyr_2.1.1 GO.db_3.14.0
#> [7] org.Hs.eg.db_3.14.0 AnnotationDbi_1.56.1
#> [9] airway_1.14.0 SummarizedExperiment_1.24.0
#> [11] Biobase_2.54.0 GenomicRanges_1.46.0
#> [13] GenomeInfoDb_1.30.0 IRanges_2.28.0
#> [15] S4Vectors_0.32.0 BiocGenerics_0.40.0
#> [17] MatrixGenerics_1.6.0 matrixStats_0.61.0
#> [19] BiocSet_1.8.1 dplyr_1.0.7
#> [21] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2
#> [4] RColorBrewer_1.1-2 httr_1.4.2 tools_4.1.1
#> [7] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
#> [10] colorspace_2.0-2 DBI_1.1.1 withr_2.4.2
#> [13] tidyselect_1.1.1 ontologyIndex_2.7 bit_4.0.4
#> [16] curl_4.3.2 compiler_4.1.1 cli_3.1.0
#> [19] formatR_1.11 DelayedArray_0.20.0 bookdown_0.24
#> [22] sass_0.4.0 scales_1.1.1 genefilter_1.76.0
#> [25] rappdirs_0.3.3 stringr_1.4.0 digest_0.6.28
#> [28] rmarkdown_2.11 XVector_0.34.0 pkgconfig_2.0.3
#> [31] htmltools_0.5.2 fastmap_1.1.0 rlang_0.4.12
#> [34] RSQLite_2.2.8 jquerylib_0.1.4 BiocIO_1.4.0
#> [37] generics_0.1.1 jsonlite_1.7.2 BiocParallel_1.28.0
#> [40] RCurl_1.98-1.5 magrittr_2.0.1 GenomeInfoDbData_1.2.7
#> [43] Matrix_1.3-4 munsell_0.5.0 Rcpp_1.0.7
#> [46] fansi_0.5.0 lifecycle_1.0.1 stringi_1.7.5
#> [49] yaml_2.2.1 zlibbioc_1.40.0 plyr_1.8.6
#> [52] grid_4.1.1 blob_1.2.2 parallel_4.1.1
#> [55] crayon_1.4.2 lattice_0.20-45 Biostrings_2.62.0
#> [58] splines_4.1.1 annotate_1.72.0 KEGGREST_1.34.0
#> [61] locfit_1.5-9.4 knitr_1.36 pillar_1.6.4
#> [64] geneplotter_1.72.0 XML_3.99-0.8 glue_1.4.2
#> [67] evaluate_0.14 BiocManager_1.30.16 png_0.1-7
#> [70] vctrs_0.3.8 gtable_0.3.0 purrr_0.3.4
#> [73] tidyr_1.1.4 assertthat_0.2.1 cachem_1.0.6
#> [76] ggplot2_3.3.5 xfun_0.27 xtable_1.8-4
#> [79] survival_3.2-13 memoise_2.0.0 ellipsis_0.3.2