A protein’s amino-acid preferences matter in modeling viral divergence

From the Bloom lab, Basic Sciences Division

Viruses undergo evolution and natural selection, just like cell-based life. In this process, viruses accumulate heritable genetic changes during their lifetime, which can arise from adaptations in response to environmental changes or the immune response of the host. Since viruses have short generation times and large population sizes, they often evolve rapidly. This can be a major threat to public health and can seriously thwart the development of successful vaccines and antiviral drugs.

One systematic way to study the evolution of viruses is through molecular phylogenetics, which analyzes genetic, hereditary molecular differences predominately in DNA sequences to gain information on a virus’ evolutionary relationships. Through this, the timing since the divergence of gene sequences can be estimated. This knowledge is important for understanding evolutionary history as well as evolutionary processes such as mutation rates, population sizes and selection pressures. Branch lengths are used to quantify genetic divergence – the longer the branch length, the greater the divergence of gene sequences (see figure). Nevertheless, in several viruses, divergence times have been underestimated, partly because of the limitation of phylogenetic models to describe the actual natural selection on protein-coding genes, which evolves under the selective removal of deleterious alleles and genetic polymorphisms known as purifying selection.  So how can these inadequacies be addressed?  Sarah Hilton, a graduate student in Dr. Jesse Bloom’s lab in the Basic Sciences Division, may have shown one way the models can be improved.

Dr. Bloom explained: “A crucial aspect of studying evolution is estimating how much time has elapsed since the divergence of related genes. In this study, Sarah showed how a new class of models that accounts for the specific constraints on different sites in a gene can remedy some problems with the accuracy of these estimates for highly diverged genes.”


Visualization of phylogenetic trees and the effect of site-specific amino-acid preferences on HA branch length estimation. Figure from publication

In their work published in a recent issue of the journal Virus Evolution, the authors examined how estimates of deep branch lengths on phylogenetic trees are affected by accounting for the fact that proteins prefer specific amino acids at specific sites. The Bloom lab had previously developed experimentally informed codon models (ExpCMs) based on which amino acids are preferred at each protein site, that have shown improved phylogenetic fit and measurements of natural selection. ExpCMs were compared to more conventional codon substitution models, as well as with models that infer the amino-acid preferences from the natural sequences, on how they estimate branch lengths on a phylogenetic tree of influenza virus hemagglutinin (HA).  The authors found that ExpCMs estimated longer deep branches, regardless of whether these site-specific amino-acid preferences are measured experimentally or inferred from sequence alignment.  This branch length estimation by ExpCMs is also similar to a mutation-selection model that infers amino-acid preferences from natural sequence data. Since different protein sites prefer different amino acids, this work highlights the importance of taking this factor into account to more accurately model purifying selection and highly diverse genes.

While modeling site-specific amino-acid preferences is a distinct improvement compared to most models, there is still more work to be done in the quest for greater model accuracy. Sarah Hilton elaborates: “Here we looked at the effect models informed by empirical measurements had on branch length estimation. Next, we are developing statistical tests to evaluate the models’ adequacy. These tests will allow us to investigate over what evolutionary time scales these models are adequate descriptors of natural sequence evolution, a question raised during the study, in a quantitative and comprehensive manner.” This could one day enhance our ability to understand virus evolution from gene sequences. Until then, making improvements to models is one step closer.

Hilton SK, Bloom JD. 2018. Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence. Virus Evol, 4(2), vey033. 

Funding was provided by the National Institutes of Health and the Howard Hughes Medical Institute.