ensembldb 2.0.4
From Bioconductor release 3.5 on, EnsDb
databases/packages created by the ensembldb
package contain also, for transcripts with a coding regions, mappings between transcripts and proteins. Thus, in addition to the RNA/DNA-based features also the following protein related information is available:
protein_id
: the Ensembl protein ID. This is the primary ID for the proteins defined in Ensembl and each (protein coding) Ensembl transcript has one protein ID assigned to it.protein_sequence
: the amino acid sequence of a protein.uniprot_id
: the Uniprot ID for a protein. Note that not every Ensembl protein_id
has an Uniprot ID, and each protein_id
might be mapped to several uniprot_id
. Also, the same Uniprot ID might be mapped to different protein_id
.uniprot_db
: the name of the Uniprot database in which the feature is annotated. Can be either SPTREMBL or SWISSPROT.uniprot_mapping_type
: the type of the mapping method that was used to assign the Uniprot ID to the Ensembl protein ID.protein_domain_id
: the ID of the protein domain according to the source/analysis in/by which is was defined.protein_domain_source
: the source of the protein domain information, one of pfscan, scanprosite, superfamily, pfam, prints, smart, pirsf or tigrfam.interpro_accession
: the Interpro accession ID of the protein domain (if available).prot_dom_start
: the start of the protein domain within the sequence of the protein.prot_dom_start
: the end position of the protein domain within the sequence of the protein.Thus, for protein coding transcripts, these annotations can be fetched from the database too, given that protein annotations are available. Note that only EnsDb
databases created through the Ensembl Perl API contain protein annotation, while databases created using ensDbFromAH
, ensDbFromGff
, ensDbFromGRanges
and ensDbFromGtf
don’t.
library(ensembldb)
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
## Evaluate whether we have protein annotation available
hasProteinData(edb)
If protein annotation is available, the additional tables and columns are also listed by the listTables
and listColumns
methods.
listTables(edb)
In the following sections we show examples how to 1) fetch protein annotations as additional columns to gene/transcript annotations, 2) fetch protein annotation data and 3) map proteins to the genome.
Protein annotations for (protein coding) transcripts can be retrieved by simply adding the desired annotation columns to the columns
parameter of the e.g. genes
or transcripts
methods.
## Get also protein information for ZBTB16 transcripts
txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs
The gene ZBTB16 has protein coding and non-coding transcripts, thus, we get the protein ID for the coding- and NA
for the non-coding transcripts. Note also that we have a transcript targeted for nonsense mediated mRNA-decay with a protein ID associated with it, but no Uniprot ID.
## Subset to transcripts with tx_biotype other than protein_coding.
txs[txs$tx_biotype != "protein_coding", c("uniprot_id", "tx_biotype",
"protein_id")]
While the mapping from a protein coding transcript to a Ensembl protein ID (column protein_id
) is 1:1, the mapping between protein_id
and uniprot_id
can be n:m, i.e. each Ensembl protein ID can be mapped to 1 or more Uniprot IDs and each Uniprot ID can be mapped to more than one protein_id
(and hence tx_id
). This should be kept in mind if querying transcripts from the database fetching Uniprot related additional columns or even protein ID features, as in such cases a redundant list of transcripts is returned.
## List the protein IDs and uniprot IDs for the coding transcripts
mcols(txs[txs$tx_biotype == "protein_coding",
c("tx_id", "protein_id", "uniprot_id")])
Some of the n:m mappings for Uniprot IDs can be resolved by restricting either to entries from one Uniprot database (SPTREMBL or SWISSPROT) or to mappings of a certain type of mapping method. The corresponding filters are the UniprotDbFilter
and the UniprotMappingTypeFilter
(using the uniprot_db
and uniprot_mapping_type
columns of the uniprot
database table). In the example below we restrict the result to Uniprot IDs with the mapping type DIRECT.
## List all uniprot mapping types in the database.
listUniprotMappingTypes(edb)
## Get all protein_coding transcripts of ZBTB16 along with their protein_id
## and Uniprot IDs, restricting to protein_id to uniprot_id mappings based
## on "DIRECT" mapping methods.
txs <- transcripts(edb, filter = list(GenenameFilter("ZBTB16"),
UniprotMappingTypeFilter("DIRECT")),
columns = c("protein_id", "uniprot_id", "uniprot_db"))
mcols(txs)
For this example the use of the UniprotMappingTypeFilter
resolved the multiple mapping of Uniprot IDs to Ensembl protein IDs, but the Uniprot ID Q05516 is still assigned to the two Ensembl protein IDs ENSP00000338157 and ENSP00000376721.
All protein annotations can also be added as metadata columns to the results of the genes
, exons
, exonsBy
, transcriptsBy
, cdsBy
, fiveUTRsByTranscript
and threeUTRsByTranscript
methods by specifying the desired column names with the columns
parameter. For non coding transcripts NA
will be reported in the protein annotation columns.
In addition to retrieve protein annotations from the database, we can also use protein data to filter the results. In the example below we fetch for example all genes from the database that have a certain protein domain in the protein encoded by any of its transcripts.
## Get all genes that encode a transcript encoding for a protein that contains
## a certain protein domain.
gns <- genes(edb, filter = ProtDomIdFilter("PS50097"))
length(gns)
sort(gns$gene_name)
So, in total we got 152 genes with that protein domain. In addition to the ProtDomIdFilter
, also the ProteinidFilter
and the UniprotidFilter
can be used to query the database for entries matching conditions on their protein ID or Uniprot ID.
AnnotationDbi
package to query protein annotationThe select
, keys
and mapIds
methods from the AnnotationDbi
package can also be used to query EnsDb
objects for protein annotations. Supported columns and key types are returned by the columns
and keytypes
methods.
## Show all columns that are provided by the database
columns(edb)
## Show all key types/filters that are supported
keytypes(edb)
Below we fetch all Uniprot IDs annotated to the gene ZBTB16.
select(edb, keys = "ZBTB16", keytype = "GENENAME",
columns = "UNIPROTID")
This returns us all Uniprot IDs of all proteins encoded by the gene’s transcripts. One of the transcripts from ZBTB16, while having a CDS and being annotated to a protein, does not have an Uniprot ID assigned (thus NA
is returned by the above call). As we see below, this transcript is targeted for non sense mediated mRNA decay.
## Call select, this time providing a GenenameFilter.
select(edb, keys = GenenameFilter("ZBTB16"),
columns = c("TXBIOTYPE", "UNIPROTID", "PROTEINID"))
Note also that we passed this time a GenenameFilter
with the keys
parameter.
Proteins can be fetched using the dedicated proteins
method that returns, unlike DNA/RNA-based methods like genes
or transcripts
, not a GRanges
object by default, but a DataFrame
object. Alternatively, results can be returned as a data.frame
or as an AAStringSet
object from the Biobase
package. Note that this might change in future releases if a more appropriate object to represent protein annotations becomes available.
In the code chunk below we fetch all protein annotations for the gene ZBTB16.
## Get all proteins and return them as an AAStringSet
prts <- proteins(edb, filter = GenenameFilter("ZBTB16"),
return.type = "AAStringSet")
prts
Besides the amino acid sequence, the prts
contains also additional annotations that can be accessed with the mcols
method (metadata columns). All additional columns provided with the parameter columns
are also added to the mcols
DataFrame
.
mcols(prts)
Note that the proteins
method will retrieve only gene/transcript annotations of transcripts encoding a protein. Thus annotations for the non-coding transcripts of the gene ZBTB16, that were returned by calls to genes
or transcripts
in the previous section are not fetched.
Querying in addition Uniprot identifiers or protein domain data will result at present in a redundant list of proteins as shown in the code block below.
## Get also protein domain annotations in addition to the protein annotations.
pd <- proteins(edb, filter = GenenameFilter("ZBTB16"),
columns = c("tx_id", listColumns(edb, "protein_domain")),
return.type = "AAStringSet")
pd
The result contains one row/element for each protein domain in each of the proteins. The number of protein domains per protein and the mcols
are shown below.
## The number of protein domains per protein:
table(names(pd))
## The mcols
mcols(pd)
As we can see each protein can have several protein domains with the start and end coordinates within the amino acid sequence being reported in columns prot_dom_start
and prot_dom_end
. Also, not all Ensembl protein IDs, like protein_id
ENSP00000445047 are mapped to an Uniprot ID or have protein domains.
Functionality to map peptide features (i.e. ranges within the amino acid sequence of the protein) to genomic coordinates are provided by the Pbase
Bioconductor package. These rely in part on the protein annotations provided by EnsDb
databases. See the corresponding vignette Pbase-with-ensembldb in that package.