scoup: Simulate Codons with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process

Hassan Sadiq1

1Department of Statistics and Actuarial Science, Stellenbosch University, South Africa

29 October 2024

Abstract

Genetic simulations are an important part of molecular biology. They are useful for assessing the efficiency and the sensitivity of models of evolution. Despite their relevance, hardly any simulator dedicated to sequence generation for natural selection inference analyses exist on the Bioconductor platform. In the broader molecular evolution genre, existing genetic simulators are yet to fully exploit the correspondence between the population genomic and the phylogenetic literature. scoup was designed as a contribution toward filling these voids. With scoup, it is possible to explore the implications of the interplay between mutation and genetic drift on the phenomenological inferences of natural selection obtained from phylogenetic models. The ratio of the variance of non-synonymous to synonymous selection coefficient (vN/vS) is also presented as a reliable selection discriminant metric. Example code of how to use the package are presented with elaborate comments.

Package

scoup 1.1.0

1 Introduction

Statistical inference of the extent at which Darwinian natural selection had impacted on genetic data from multiple populations commands a healthy quota of the phylogenetic literature (Jacques et al. 2023). Validation of these codon-based models relies heavily on simulated data. A search of the entries on the Bioconductor (Gentleman et al. 2004) platform, on 29 July 2024, with keywords codon, mutation, selection, simulate and simulation returned a total of 72 unique (out of the 2300 available Software) packages. None of the retrieved entries was dedicated to codon data simulation for natural selection analyses. Given the ever increasing diverse types of models of natural selection inference from molecular data that exist, there is indeed need for applicable packages on the platform.

Population genomic studies provided the mathematical foundation upon which phylogenetics thrived (Wright 1931; Fisher 1922; Hardy 1908; Weinberg 1908; Darwin 1859). The thirst to bridge the gap between these two genres of evolutionary biology continue to drive the invention of more complex models of evolution (Aris-Brosou and Rodrigue 2012). Consequently, there is need to develop codon sequence simulators to match the growth. scoup is designed on the basis of the mutation-selection (MutSel) framework (Halpern and Bruno 1998) as a contribution to this quest. Only a couple of existing selection-focused simulators in the literature used the MutSel framework (Spielman and Wilke 2015; Arenas and Posada 2014). This is most probably due to the perceived complexity of the methodology. In scoup, the versatility of the Uhlenbeck and Ornstein (1930) algorithm as a framework for evolutionary analyses (Bartoszek et al. 2017) was exploited to circumvent the complexity.

In a bid to identify an appropriate quantifier that permits direct comparison between the degree of selection signatures imposed during simulation and that inferred, the ratio of the variance of the non-synonymous to synonymous selection coefficients (vN/vS) was discovered to be appropriate. The vN/vS statistic is consequently posited as a quality selection discriminant metric. scoup therefore represents an important contribution to the phylogenetic modelling literature. Example code of how to successfully use the package is presented below.

2 Installation

Use the following code to install scoup from the Bioconductor platform.

if(!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scoup")

3 Sample Code

Three primary evolutionary algorithms are available in scoup. These include the frequency-dependent (Jones et al. 2017; Ayala and Campbell 1974), the Ornstein-Uhlenbeck (OU) and the discrete algorithms. Example of R (R Core Team 2024) code where these functions were utilised are presented. The homogeneous (Muse and Gaut 1994; Goldman and Yang 1994) and heterogeneous (site-wise and branch-wise) (Nielsen and Yang 1998; Kosakovsky Pond and Frost 2005) selection inference modelling contexts were explored. Data quality was assessed by comparing the maximum likelihood inferred (\(\omega\)) and the analytically calculated (\(\mathrm{d}N/\mathrm{d}S\)) estimates to the magnitude of the imposed selection pressure (measured by vN/vS). Template code used to analyse the output data (to obtain \(\omega\)) with PAML (Álvarez-Carretero, Kapli, and Yang 2023; Yang 2007) and FUBAR (Murrell et al. 2013; Kosakovsky Pond, Frost, and Muse 2005) are presented in the Appendix. The R code presented subsequently, require that the user should have already installed the scoup package.

3.1 Ornstein-Uhlenbeck Sensitivity

# Make package accessible in R session
library(scoup)

## Loading required package: Matrix

# Number of extant taxa
## Excluded values contributed to results presented in article
leaves <- 8 # 64

# Number of codon sites
## Excluded values contributed to results presented in article
sSize <- 15 # 250

# Number of data replications for each parameter combination
## Edited count was used for the results presented in article
sims <- 1 # 50

# OU reversion parameter (Theta) value
## Excluded values contributed to results presented in article
eThta <- c(0.01) # c(0.01, 0.1, 1)

# OU asymptotic variance value
## Excluded values contributed to results presented in article 
eVary <- c(0.0001) # c(0.0001, 0.01, 1)

# OU landscape shift parameters
hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01))

# Sequence alignment size information
seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves))

# Iterate over all listed OU variance values
for(g in seq(1,length(eVary))){

    # Iterate over all listed OU reversion parameter values
    for(h in seq(1,length(eThta))){

        # Create appropriate simulation function ("ou") object
        adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h]))

        # Iterate over the specified number of replicates
        for(i in seq(1,sims)){

            # Execute simulation
            simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL)
        }
    }
}
# Print simulated alignment
seqCOL(simData)

## DNAStringSet object of length 8:
##     width seq                                               names               
## [1]    45 ACGACAGATGGATTCACAGCGAGGCTGGTACTCCGAACTCTAACT     S001
## [2]    45 ACGACAGATGGATTCACAGCAGGGCTGGTACTCCGAACTCTAACA     S002
## [3]    45 ACGTCAGATGGATTCACAGCGGGGGTGGTATTCCGAACTCTATCA     S003
## [4]    45 ACGTCAGATGGATTCACAGCGGGGCTGGTATTCCGAACTCTATCA     S004
## [5]    45 ACGCCAGATGGATTCACAGCGGGGCTGGTATTCCCAACTCTATCA     S005
## [6]    45 ACGCCAGATGGATTCACAGCGGGGCTGGTATTCCCAACTCTATCA     S006
## [7]    45 ACGCCAGATGGATTCACAGCGGGGCTGGTATTCATAACTCTATCA     S007
## [8]    45 ACGCCAGATGGATTCACAGCGGGGCTGGTATTCATAACTCTATCA     S008

As expected, the correlation coefficient estimate was approximately \(0.9974\) when the means of the inferred (\(\omega\)) and the calculated (\(\mathrm{d}N/\mathrm{d}S\)) selection effects were compared. The correlation estimation included all the commented values.

3.2 vN/vS Sensitivity

# Make package accessible in R session
library(scoup)

# Number of extant taxa
## Omitted value was used for the results presented in article
xtant <- 8 # 64

# Number of codon sites
## Omitted count was used for the results presented in article
siteSize <- 15 # 64

# Number of data replications for each parameter combination
## Omitted count was used for the results presented in article
simSize <- 1 # 50

# Variance of the non-synonymous selection coefficients
## Excluded values contributed to results presented in article
nsynVary <- c(0) # c(0, 0.001, 0.1)

# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
vNvSvec <- c(0) # c(0, 0.001,  1, 10)

# Sequence alignment size information
seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant))

# Iterate over all listed coefficient variance ratios
for(a in seq(1,length(vNvSvec))){

    # Iterate over all listed non-synonymous coefficients variance
    for(b in seq(1,length(nsynVary))){

        # Create appropriate simulation function ("omega") object
        adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b]))
        
        # Iterate over the specified number of replicates
        for(i in seq(1,simSize)){

            # Execute simulation
            simulateSeq <- alignsim(adaptData, seqStat, NA)
        }
    }
}
# Print simulated alignment
cseq(simulateSeq)

##                                            Sequence
##  S001 GGCACGGGGGCGAGACTTCTCCCCTCGACCTGTATATCCGCCCGA
##  S002 GGCCCTCGAGTGAGACTTTCACCCCCAATCTGGATGTCCGTCCAG
##  S003 GGCCCGCACGTAAGGTTTTTAGCCTCAAACTATATCTCCCTCCAA
##  S004 GGCCCGCATGTAAGTCCTTCTCCCTCAGACTATACATCCACCCGG
##  S005 CGCCCGCAAGTGAGTACTTCCCCCTTGGACCATTTATCCTTTCGA
##  S006 CGTCCGCAAGTGGGTGTTTTTCCCCTGAACCGCCTGTCCTTGCGT
##  S007 GGTCAACAGGTAGGTCTTTTCTCGCTGAATCACCGGTCCCTCTGG
##  S008 GGTCAACAAATAAGTTTTTTCTCCCCGGATCACAGGCCCTACCGG

Sequences generated with the presented code (with the excluded values activated) produced strongly correlated selection inferences (correlation coefficient \(\approx 0.9923\)) when the average \(\mathrm{d}N/\mathrm{d}S\) and the \(\omega\) values were compared. This implementation is an example of how to execute the frequency-dependent evolutionary technique with the package.

3.3 Site-wise Application

# Make package accessible in R session
library(scoup)

# Number of codon sites
## Commented value was used for results presented in article
sitesize<- 15 # 100

# Variance of non-synonymous selection coefficients
nsynVary <- 0.01

# Number of extant taxa
## Commented value was used for results presented in article
taxasize <- 8 # 1024

# Sequence alignment size information
seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize))

# Create the applicable ("ou") object for simulation function
## eVar= OU asymptotic variance, Theta=OU reversion parameter
adaptEntry <- ouInput(c(eVar=0.1,Theta=1))

# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
sratio <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 1000)

# Iterate over all listed coefficient variance ratios
for(a0 in seq(1,length(sratio))){

    # OU landscape shift parameters
    mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary))
    
    # Execute simulation
    simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA)
}
# Print simulated codon sequence
cseq(simSeq)

##                                            Sequence
##  S001 CTCATACTGACGATGTCCAATCGCGCGGCGGACGGCCTTCACCCA
##  S002 CTCATACTGACGATGTCCAATCGCGCGGCGGACGGCCCTCACCCA
##  S003 CTCATGCTGACGATGTCTAACCGTGCGGCGGACGGCCTTCACCCG
##  S004 CTCATGCTGACCATGTCCAACCGGGCGGCGGGCGGGCTTCACCCA
##  S005 CTAATGCTGACAATGCCAAATCGTGCCGCGGGCGGCCTTCACCCG
##  S006 CCCATGCTAACGATGCCAAATCGTGCAGCGGGCGGCCGTCACCCG
##  S007 CTCATGCTGACGATGTCAAATCGGGCAGCGGACGGACTCCATCTA
##  S008 CTCATGTTGACGATGTCTAATCGGGCAGCGGACGGCCTCCATCCA

This is another example of how to call the OU shifting landscape evolutionary approach. The results obtained yielded a pairwise correlation coefficient estimate of approximately \(0.9988\) between the means of \(\mathrm{d}N/\mathrm{d}S\) and \(\omega\). The correlation coefficient estimates were approximately \(0.8123\) and \(0.8305\) when the averages were each compared to vN/vS, respectively.

3.4 Branch-wise (Episodic) Test

# Make package accessible in R session
library(scoup)

# Number of internal nodes on the desired balanced tree
iNode <- 3

# Number of required codon sites
## Excluded value was used for the results presented in article
siteCount <- 15 # 1000

# Variance of non-synonymous selection coefficients
nsnV <- 0.01

# Number of data replications for each parameter combination
## Edited count was used for the results presented in article
nsim <- 1 # 50

# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
vNvSvec <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 100)

# Sequence alignment size information
seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10))

# Iterate over all listed coefficient variance ratios
for(h in seq(1,length(vNvSvec))){

    # Iterate over the specified number of replicates
    for(i in seq(1,nsim)){

        # Create the parameter set applicable at each internal tree node
        scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]),
                        nsynVar=rep(nsnV,iNode))
        
        # Create the applicable ("discrete") object for simulation function
        adaptBranch <- discreteInput(list(p02xnodes=scInput))
        
        # Execute simulation
        genSeq <- alignsim(adaptBranch, seqsBwise, NULL)
    }
}
# Print simulated sequence data
seqCOL(genSeq)

## DNAStringSet object of length 8:
##     width seq                                               names               
## [1]    45 TATTCGACTCAAGCTACTACGTCGCTGATGCTGCAACTGGATAGA     S001
## [2]    45 TATTCAACTCAAGCTACTACGTCTCTTATGCTGCAACTGGATAGA     S002
## [3]    45 TATTCAACTCAAGCGATTACGTCTTTAATGCTACAACTAGATAGA     S003
## [4]    45 TATTCAACTCAAGCGATTACATCTTTGATGCTACAACTAGATAGA     S004
## [5]    45 TATCCGACACAAGCTACCACATCTTTAATGCTGCAACTAGATAGA     S005
## [6]    45 TATCCAACTCAAGCTACTACATCCTTAATGCTACAGTTAGATAGA     S006
## [7]    45 TATTCGACTCAATCTACTACGTCCTTAATTCTGCAGCTTGATAGA     S007
## [8]    45 TACTCGACTCAAGCTACTACGTCCTTGATGCTGCAGCTTGATCGA     S008

The correlation coefficient between the averages of the analytical \(\mathrm{d}N/\mathrm{d}S\) and the inferred \(\omega\) estimates was approximately \(0.9998\), obtained from 50 independent iterations of the code for all the listed vN/vS values. The correlation coefficient estimate was approximately \(0.6349\) for vN/vS vs \(\omega\) and \(0.6360\) for vN/vS vs \(\mathrm{d}N/\mathrm{d}S\).

4 Conclusion

Reference scoup code were presented to facilitate use of the package. Although not explicitly presented, it is also possible to generate data with signatures of directional selection by setting the aaPlus element of the wInput entry of the alignsim function accordingly. The capacity of the package is expected to be extended in future versions.

5 Citation

A more appropriate citation will be provided for the package after it has been accepted to the Bioconductor platform and after the corresponding article has been accepted for publication. In the meantime, to cite this package, use Sadiq, H. 2024. “scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process”. R Package. doi:10.18129/B9.bioc.scoup.

6 References

Álvarez-Carretero, Sandra, Paschalia Kapli, and Ziheng Yang. 2023. “Beginner’s Guide on the Use of PAML to Detect Positive Selection.” Molecular Biology and Evolution 40 (4): msad041.

Arenas, Miguel, and David Posada. 2014. “Simulation of Genome-wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories.” Molecular Biology and Evolution 31 (5): 1295–1301.

Aris-Brosou, Stéphane, and Nicolas Rodrigue. 2012. “The Essentials of Computational Molecular Evolution.” In Evolutionary Genomics: Statistical and Computational Methods, Volume 1, edited by Maria Anisimova, 855:111–52. Methods in Molecular Biology, Springer Science+Business Media, LLC.

Ayala, Francisco J, and Cathryn A Campbell. 1974. “Frequency-Dependent Selection.” Annual Review of Ecology and Systematics 5: 115–38.

Bartoszek, Krzysztof, Sylvain Glémin, Ingemar Kaj, and Martin Lascoux. 2017. “Using the Ornstein–Uhlenbeck Process to Model the Evolution of Interacting Populations.” Journal of Theoretical Biology 429: 35–45.

Darwin, Charles. 1859. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. John Murray, London, UK.

Fisher, R A. 1922. “On the Dominance Ratio.” Proceedings of the Royal Society of Edinburgh 42: 321–41.

Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5 (10): R80.

Goldman, Nick, and Ziheng Yang. 1994. “A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences.” Molecular Biology and Evolution 11 (5): 725–36.

Halpern, Aaron L, and William J Bruno. 1998. “Evolutionary Distances for Protein-Coding Sequences: Modelling Site-Specific Residue Frequencies.” Molecular Biology and Evolution 15 (7): 910–17.

Hardy, Godfrey Harold. 1908. “Mendelian Proportions in a Mixed Population.” Science 28 (706): 49–50.

Jacques, Florian, Paulina Bolivar, Kristian Pietras, and Emma U Hammarlund. 2023. “Roadmap to the Study of Gene and Protein Phylogeny and Evolution – A Practical Guide.” PLoS ONE 18 (2): e0279597.

Jones, Christopher T, Noor Youssef, Edward Susko, and Joseph P Bielawski. 2017. “Shifting Balance on a Static Mutation-Selection Landscape: A Novel Scenario of Positive Selection.” Molecular Biology and Evolution 34 (2): 391–407.

Kosakovsky Pond, Sergei L, and Simon D W Frost. 2005. “A Genetic Algorithm Approach to Detecting Lineage-Specific Variation in Selection Pressure.” Molecular Biology and Evolution 22 (4): 478–85.

Kosakovsky Pond, Sergei L, Simon D W Frost, and S V Muse. 2005. “HyPhy: Hypothesis Testing Using Phylogenies.” Bioinformatics 29: 676–79.

Murrell, Ben, Sasha Moola, Amandla Mabona, Thomas Weighill, Daniel Sheward, Sergei L Kosakovsky Pond, and Konrad Scheffler. 2013. “FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection.” Molecular Biology and Evolution 30 (5): 1196–1205.

Muse, Spencer V, and Brandon S Gaut. 1994. “A Likelihood Approach for Comparing Synonymous and Nonsynonymous Nucleotide Substitution Rates, with Application to the Chloroplast Genome.” Molecular Biology and Evolution 11 (5): 715–24.

Nielsen, Rasmus, and Ziheng Yang. 1998. “Likelihood Models for Detecting Positively Selected Amino Acid Sites and Applications to the HIV-1 Envelope Gene.” Genetics 148 (3): 929–36.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Spielman, Stephanie J, and Claus O Wilke. 2015. “Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies.” PLoS ONE 10 (9): e0139047.

Uhlenbeck, G E, and L S Ornstein. 1930. “On the Theory of the Brownian Motion.” Physical Review 36: 823–41.

Weinberg, Wilhelm. 1908. “Über den Nachweis der Vererbung beim Menschen.” Jahreshefte Des Vereins Für Vaterländische Naturkunde in Württemberg 64: 369–82.

Wright, Sewall. 1931. “Evolution in Mendelian Populations.” Genetics 16: 97–159.

Yang, Ziheng. 2007. “PAML 4: phylogenetic analysis by maximum likelihood.” Molecular Biology and Evolution 24 (8): 1586–91.

Appendix

A Appendix: Sample Data Analyses (Non-R) Code

CODEML script executed in PAML to infer single alignment-wide \(\omega\) estimates for data sets generated from 50 independent executions of each of the sensitivity analyses code presented above. The same CODEML script was used to analyse data (also 50 replicates) from the episodic analyses code, with the model entry replaced with 2. The scoup simulated sequence data and tree are seq.nex and seq.tre, respectively.

    seqfile   = seq.nex
    treefile  = seq.tre
    outfile   = seq.out
    noisy     = 0
    verbose   = 0
    seqtype   = 1
    ndata     = 1
    icode     = 0
    cleandata = 0
    model     = 0
    NSsites   = 0
    CodonFreq = 3
    estFreq   = 0
    clock     = 0
    fix_omega = 0
    omega     = 1e-05

FUBAR command executed with HyPhy through the terminal in MacBook. Note that HyPhy was already installed on the computer. The seq.nex input is the scoup simulated codon sequence data that is saved in the same NEXUS file with the tree data. The NEXUS file resides in the working directory.

hyphy fubar --code Universal --alignment seq.nex --tree seq.nex

B Session info

The output of sessionInfo() from the computer where this file was generated is provided below.

## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scoup_1.1.0      Matrix_1.7-1     BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.3            httr_1.4.7              cli_3.6.3              
##  [4] knitr_1.48              rlang_1.1.4             xfun_0.48              
##  [7] UCSC.utils_1.3.0        jsonlite_1.8.9          S4Vectors_0.45.0       
## [10] Biostrings_2.75.0       htmltools_0.5.8.1       stats4_4.5.0           
## [13] sass_0.4.9              rmarkdown_2.28          grid_4.5.0             
## [16] evaluate_1.0.1          jquerylib_0.1.4         fastmap_1.2.0          
## [19] GenomeInfoDb_1.43.0     IRanges_2.41.0          yaml_2.3.10            
## [22] lifecycle_1.0.4         bookdown_0.41           BiocManager_1.30.25    
## [25] compiler_4.5.0          XVector_0.47.0          lattice_0.22-6         
## [28] digest_0.6.37           R6_2.5.1                GenomeInfoDbData_1.2.13
## [31] bslib_0.8.0             tools_4.5.0             zlibbioc_1.53.0        
## [34] BiocGenerics_0.53.0     cachem_1.1.0