The treeio package is an infrastructure that enables evolutionary evidences that inferred by commonly used software packages in the field to be used in R. For instance, dN/dS values or ancestral sequences inferred by CODEML1, clade support values (posterior) inferred by BEAST2 and short read placement by EPA3 and pplacer4. These evolutionary evidences can be further analyzed in R and used to annotate phylogenetic tree using ggtree.

Supported File Formats

Most of the software in this field (including R packages) focus on Newick and Nexus file formats, while there are file formats from different evolution analysis software that contain supporting evidences within the file that are useful for further summary, analysis and annotation of the tree.

The treeio package define several parser functions and S4 classes to store statistical evidences inferred by commonly used software packages. It supports several file formats, including:

and software output from:

Parser functions

The treeio package implement several parser functions, including:

S4 classes

Correspondingly, ggtree defines several S4 classes to store evolutionary evidences inferred by these software packages, including:

The jplace class is also designed to store user specified annotation data.

In treeio, tree objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized.

For each class, we defined get.fields method to get the annotation features that available in the object that can be used to annotate a phylogenetic tree directly in ggtree. A get.tree method can be used to convert tree object to phylo (or multiPhylo for r8s) object that are widely supported by other R packages.

Detail descriptions of slots defined in each class are documented in class man pages. Users can use class?className (e.g. class?beast) to access man page of a class.

Getting Tree Data into R

Parsing BEAST output

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
beast
## 'beast' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/BEAST/beast_mcc.tree'.
## 
## ...@ tree: 
## Phylogenetic tree with 15 tips and 14 internal nodes.
## 
## Tip labels:
##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'height',   'height_0.95_HPD',  'height_median',    'height_range', 'length',
##  'length_0.95_HPD',  'length_median',    'length_range', 'posterior',    'rate',
##  'rate_0.95_HPD',    'rate_median',  'rate_range'.

Since % is not a valid character in names, all the feature names that contain x% will convert to 0.x. For example, length_95%_HPD will be changed to length_0.95_HPD.

The get.fields method return all available features that can be used for annotation.

get.fields(beast)
##  [1] "height"          "height_0.95_HPD" "height_median"  
##  [4] "height_range"    "length"          "length_0.95_HPD"
##  [7] "length_median"   "length_range"    "posterior"      
## [10] "rate"            "rate_0.95_HPD"   "rate_median"    
## [13] "rate_range"

Parsing PAML output

rst file

The read.paml_rst function can parse rst file from BASEML and CODEML. The only difference is the space in the sequences. For BASEML, each ten bases are separated by one space, while for CODEML, each three bases (triplet) are separated by one space.

brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <- read.paml_rst(brstfile)
brst
## 'paml_rst' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/PAML_Baseml/rst'.
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  'marginal_subs',    'joint_subs',   'marginal_AA_subs', 'joint_AA_subs'.

Similarly, we can parse the rst file from CODEML and visualize the tree with amino acid substitution.

crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio")
crst <- read.paml_rst(crstfile)
crst
## 'paml_rst' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/PAML_Codeml/rst'.
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  'marginal_subs',    'joint_subs',   'marginal_AA_subs', 'joint_AA_subs'.

We can find that these two figures have different evolution distances and subsitutions inferred by BASEML and CODEML are slightly different.

CODEML

mlc file

The mlc file output by CODEML contains dN/dS estimation.

mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio")
mlc <- read.codeml_mlc(mlcfile)
mlc
## 'codeml_mlc' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/PAML_Codeml/mlc'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  't',    'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',   'S_x_dS'.

Please aware that / and * are not valid characters in names, they were changed to _vs_ and _x_ respectively.

So dN_vs_dS is dN/dS, N_x_dN is N*dN, and S_x_dS is S*dS.

CODEML output: rst and mlc files

We annotate the tree with information presented in rst and mlc file separately as demonstrated in previous sessions.

We can also use both of them which is highly recommended.

ml <- read.codeml(crstfile, mlcfile)
ml
## 'codeml' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/PAML_Codeml/rst' and 
##  '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/PAML_Codeml/mlc'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  'marginal_subs',    'joint_subs',   'marginal_AA_subs', 'joint_AA_subs',    't',
##  'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',   'S_x_dS'.

All the features in both files are available for annotation. For example, we can annotate dN/dS with the tree in rstfile and amino acid substitutions with the tree in mlcfile.

Parsing HYPHY output

nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
hy
## 'hyphy' S4 object that stored information of 
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/HYPHY/labelledtree.tree', 
##  '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/HYPHY/ancseq.nex' and 
##  '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/pa.fas'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  K, N, D, L, J, G, ...
## Node labels:
##  Node1, Node2, Node3, Node4, Node5, Node12, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  'subs', 'AA_subs'.

Parsing r8s output

r8s output contains 3 trees, namely TREE, RATO and PHYLO for time tree, rate tree and absolute substitution tree respectively.

r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio"))
r8s
## 'r8s' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/r8s/H3_r8s_output.log'.
## 
## ...@ trees: 
## 
## with the following features available:
##  'TREE', 'RATO', 'PHYLO'.

Parsing output of RAxML bootstraping analysis

raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio")
raxml <- read.raxml(raxml_file)
raxml
## 'treedata' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/RAxML/RAxML_bipartitionsBranchLabels.H3'.
## 
## ...@ tree: 
## Phylogenetic tree with 64 tips and 62 internal nodes.
## 
## Tip labels:
##  A_Hokkaido_M1_2014_H3N2_2014, A_Czech_Republic_1_2014_H3N2_2014, FJ532080_A_California_09_2008_H3N2_2008, EU199359_A_Pennsylvania_05_2007_H3N2_2007, EU857080_A_Hong_Kong_CUHK69904_2006_H3N2_2006, EU857082_A_Hong_Kong_CUHK7047_2005_H3N2_2005, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##  'bootstrap'.

Parsing NHX tree

NHX (New Hampshire eXtended) format is commonly used in phylogenetics software (e.g. PHYLDOG6, RevBayes9, etc.)

nhxfile <- system.file("extdata/NHX", "ADH.nhx", package="treeio")
nhx <- read.nhx(nhxfile)
nhx
## 'treedata' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/NHX/ADH.nhx'.
## 
## ...@ tree: 
## Phylogenetic tree with 8 tips and 4 internal nodes.
## 
## Tip labels:
##  ADH2, ADH1, ADHY, ADHX, ADH4, ADH3, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'S',    'D',    'B'.
## ggtree(nhx) + geom_tiplab() + geom_point(aes(color=S), size=5, alpha=.5) +
##     theme(legend.position="right") +
##     geom_text(aes(label=branch.length, x=branch), vjust=-.5) +
##     xlim(NA, 0.3)

Parsing Phylip tree

Phylip format contains taxa sequences with Newick tree text.

phyfile <- system.file("extdata", "sample.phy", package="treeio")
phylip <- read.phylip(phyfile)
phylip
## 'phylip' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/sample.phy'.
## 
## ...@ tree: 
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  K, N, D, L, J, G, ...
## 
## Unrooted; no branch lengths.
## 
## with sequence alignment available (15 sequences of length 2148)

Parsing EPA and pplacer output

EPA3 and PPLACER4 have common output file format, jplace, which can be parsed by read.jplace() function.

jpf <- system.file("extdata/sample.jplace",  package="treeio")
jp <- read.jplace(jpf)
print(jp)
## 'jplace' S4 object that stored information of
##   '/tmp/RtmpQnlgEk/Rinst195ebb11106/treeio/extdata/sample.jplace'. 
## 
## ...@ tree: 
## Phylogenetic tree with 13 tips and 12 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'edge_num', 'likelihood',   'like_weight_ratio',    'distal_length',
##  'pendant_length'.

In ggtree, we provide get.placements method to access the placement.

## get only best hit
get.placements(jp, by="best")
##   name edge_num likelihood like_weight_ratio distal_length pendant_length
## 1   AA       24  -61371.30          0.333344         3e-06       0.003887
## 2   BB        1  -61312.21          0.333335         1e-06       0.000003
## 3   CC        8  -61312.23          0.200011         1e-06       0.000003
## get all placement
get.placements(jp, by="all")
##   name edge_num likelihood like_weight_ratio distal_length pendant_length
## 1   AA       24  -61371.30          0.333344      0.000003       0.003887
## 2   BB        1  -61312.21          0.333335      0.000001       0.000003
## 3   BB        2  -61322.21          0.333322      0.000003       0.000003
## 4   BB      550  -61352.21          0.333322      0.000961       0.000003
## 5   CC        8  -61312.23          0.200011      0.000001       0.000003
## 6   CC        9  -61322.23          0.200000      0.000003       0.000003
## 7   CC       10  -61342.23          0.199992      0.000003       0.000003

This is only a tiny sample file. In reality, EPA and PPLACER may place thousands of short reads on a reference tree.

We may, for example, count the number of placement and annotate this information in the tree.

References

1. Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24, 1586–1591 (2007).

2. Bouckaert, R. et al. BEAST 2: A software platform for bayesian evolutionary analysis. PLoS Comput Biol 10, e1003537 (2014).

3. Berger, S. A., Krompass, D. & Stamatakis, A. Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology 60, 291–302 (2011).

4. Matsen, F. A., Kodner, R. B. & Armbrust, E. V. Pplacer: Linear time maximum-likelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010).

5. Pond, S. L. K., Frost, S. D. W. & Muse, S. V. HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21, 676–679 (2005).

6. Boussau, B. et al. Genome-scale coestimation of species and gene trees. Genome Res. 23, 323–330 (2013).

7. Marazzi, B. et al. Locating evolutionary precursors on a phylogenetic tree. Evolution 66, 3918–3930 (2012).

8. Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics btu033 (2014). doi:10.1093/bioinformatics/btu033

9. Höhna, S. et al. Probabilistic graphical model representation in phylogenetics. Syst Biol 63, 753–771 (2014).

10. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of phylogenetics and evolution in r language. Bioinformatics 20, 289–290 (2004).

11. Schliep, K. P. Phangorn: Phylogenetic analysis in r. Bioinformatics 27, 592–593 (2011).