From the earliest days of the pandemic that shocked the world in 2020, researchers at Fred Hutch Cancer Center tracked the rapid evolution and spread of the virus that causes COVID-19 by studying how its genomic sequence — the order of its genetic building blocks — changed over time.
Computational biologist Trevor Bedford, PhD, and evolutionary biologist Jesse Bloom, PhD, became go-to media sources, providing expert information that influenced consequential policy decisions about what to shut down and when.
Researchers have now amassed an enormous database of genomic sequences of SARS-CoV-2 and its many variants, sampled from more than 16 million patients around the world over the last five years.
But the sheer volume of genomic sequences — a dataset that is orders of magnitude bigger than what’s available for any other pathogen — has overwhelmed the capacity of common analytical methods to make sense of it in a timely and practical manner.
Two recently published papers from postdoctoral researchers in the Bedford and Bloom labs showcase new tools invented at Fred Hutch that provide researchers traction to make that mountain of data more manageable.
The Bedford study, recently published in the journal Nature, uses a new statistical technique to track the spread of a virus through a population.
The analysis of more than 100,000 genomic sequences of SARS-CoV-2 collected in Washington state between March 2021 and December 2022 captures fine-grain details about transmission between regions and age groups, including important information about the role young children played in transmission.
Meanwhile the Bloom study, recently published in the journal Cell, shows that the same method developed in his lab to analyze the effects of thousands of SARS-CoV-2 mutations in a single, safe experiment can also be applied to understand the evolutionary capacity of viruses for which only a small number of sequences are available.
This study uses the Bloom method to safely measure the effects of mutations to a key protein from Nipah virus, a rarer but scarier and deadlier pathogen that some researchers worry could pose a risk of triggering a new outbreak or pandemic.
A new way to see the forest when the trees grow too thick
Bedford’s early career focused on using genomic sequences to construct family trees of viruses and visualize their evolution as easily as someone traces the branches of their ancestry from parents to grandparents to great-grandparents.
He built phylogenetic trees to better understand influenza, Ebola, MERS and other viruses and co-founded an open-source website in 2015 called Nextstrain that posts the phylogenetic trees of global pathogens.
In February 2020, Bedford used phylogenetic trees to analyze genomic sequences of a newly emerged virus, SARS-CoV-2, that already was overwhelming hospitals in Wuhan, China.
His analysis revealed a probable transmission chain in Washington state that began in mid-January with the first diagnosed U.S. patient, a man who returned from Wuhan. The genomic sequence of a second Seattle-area patient in late February had several new mutations indicating that the virus had been spreading quickly and largely undetected in Washington.
Bedford sounded the alarm, prompting a rapid shutdown of the region that likely saved thousands of lives in the state.
Nextstrain became an essential tool for visualizing and tracing the evolution of SARS-CoV-2, but as the number of SARS-CoV-2 genomic sequences has swelled, phylogenetic trees have become increasingly unwieldy.
When those family trees grow beyond a few hundred or a few thousand sequences, it gets harder computationally to infer what’s going on. There’s just too many branches and twigs and offshoots to untangle.
“It just becomes computationally very costly to reconstruct that tree,” said Cécile Tran-Kiem, PhD, lead author of the Nature study and a postdoctoral researcher in the Bedford Lab in the Vaccine and Infectious Disease Division at Fred Hutch.
She figured out a different way to track the spread of the virus that doesn’t require building trees showing new branches that represent new mutations.
Her method instead looks for pairs of identical genomic sequences to figure out how different subgroups of the population, such as those defined by geography or age, contribute to transmission.
The method takes advantage of a mismatch between the rate of transmission and the rate of mutation in SARS-CoV-2.
“We expect the virus to mutate every 11 to 12 days, but transmission maybe happens every six days, more or less,” Tran-Kiem said. “This means if we’re looking at viruses that haven’t mutated yet, we’re looking at people who are pretty close in a transmission chain.”
The more often pairs of identical genomic sequences straddle two groups — with one half of the pair in one group and the other half in the other group — the greater the probability of higher transmission between those groups.
The method finds statistical patterns and connections that might otherwise be missed because the virus is moving too fast in the population for the tree method based on new mutations to keep pace.
Testing the method against real-world data from Washington state
The idea made computational sense, but Tran-Kiem wanted to test it against real-world data to see if her method reached the same conclusions about the spread of the virus as conventional methods that epidemiologists use to track the spread of disease.
“Classically, it’s been shown that patterns of transmission of respiratory pathogens correlate well with mobility data — the virus tends to go where you move,” she said.
Tran-Kiem applied the method to 114,298 SARS-CoV-2 genomes collected in Washington state. The patterns of paired sequences were consistent with expectations from mobility data collected from smartphone records.
She also was able to track transmission between age groups, which also was consistent with expectations based on social contact surveys.
But her approach could potentially overcome the limitations of tracking smartphones (not everyone has one) or conducting surveys (memories of previous contacts may be inaccurate or incomplete). It also saves money by reducing the need for phone and survey data, which is costly to collect.
As predicted, her method showed that adjacent counties were more likely to be linked by identical pairs of sequences than regions that are further apart, except for adjacent counties separated by the Cascade Mountain Range. Also as expected, transmission generally flowed from Western to Eastern Washington.
Some data didn’t match expectations, however.
For example, Tran-Kiem and her colleagues found two counties that shared many more pairs of identical sequences than they should have based on their geography.
They realized that what they were seeing in the genomic data was transmission between male prisons. They discussed their results with epidemiologists and physicians working with the Washington State Department of Corrections.
They said that certain policies and procedures regarding the transfer of prisoners and staff between prisons could explain patterns that did not show up in conventional mobility data. The state has only two women’s prisons, which are in adjacent counties and didn’t stand out.
By analyzing the timing of sequence collection, she also could use the method to better understand transmission between age groups and provide some context to the highly debated role played by young children during the pandemic.
Young kids are notorious germ-factories who spread colds to classmates and bring them home from school, so initially it was not unreasonable to think they might also play a big role in transmitting COVID-19.
But while it is possible for young children to transmit the virus to adults, that’s not usually how the virus spread in Washington state.
During the alpha and delta variant waves, she found that children ages 0 to 9 could have been a source of infection for the elderly, but not younger adults.
But that pattern disappeared during the omicron wave and overall, she found no indication that young school age children were a major source of transmission in Washington state, even after schools reopened.
“We know for sure they did infect adults, we’re just saying that overall, when there was transmission between children and adults, it didn’t tend to be the kids who infected the adults. It rather tended to be in the other direction.”
Information like that — which is more precise than data about where someone’s phone has been or who they remember being around before they got sick — could be useful in the future when policymakers must make decisions about whether to close schools and for how long.