Most infectious disease researchers are familiar with the story of how Dr. John Snow tracked a devastating cholera outbreak in London in 1854. By marking the houses impacted by cholera on a map, he found that most cases geographically clustered around a communal water pump on Broad Street. Cases from outside of this block were ultimately traced back to this same water pump, which had indeed become contaminated with Vibrio cholera. Shutting down this pump helped end the outbreak and ultimately paved the way for modern sanitation and public health measures we know today.
I like this story because it has a simple moral: when tracking infectious diseases, we often need to focus what is constant in order to find patterns that could otherwise be lost in a mess of variables.
This is the same principle that Dr. Cécile Tran-Kiem a post-doc in the Bedford Lab, is using to trace SARS-CoV-2 spread in Washington state. In a new publication in Nature, she and co-authors use geographic proximity of identical viral genomic sequences to understand viral transmission patterns.
One of the Bedford lab’s go-to tools is drawing phylogenetic trees to map viral genome diversity and mutations over time. However, phylogenetic trees are impractical for large-scale pathogen tracking because generating trees and inferring geographic patterns requires significant computational power, which limits how many viral sequences can be included in the analysis. Furthermore, the conclusions can be thrown off by uneven sampling in different locations.
So, instead of focusing on how and where the viral genome changes, the authors tracked spread by following identical SARS-CoV-2 sequences across Washington state. The principle is: a newly infected person is likely to have a genetically identical pathogen as the person who spread the bug, who is probably geographically nearby. After all, it’s not like SARS-CoV-2 mutates every time it transmits—in fact, mutations during acute infection are relatively rare. “It’s really about 1 mutation every 2 weeks along a transmission chain,” says Dr. Bedford.
Grouping viral sequences into clusters of identical sequences allowed the authors to capture epidemiologically linked infections. More detail on this clustering method and how it can be used to understand viral variant tracking can be found here.
To make these clusters, the authors used genomes obtained by comprehensive genomic surveillance in Washington State conducted by the WA Public Health Lab, UW Virology, and the Seattle Flu study. This genomic surveillance includes de-identified information on age, vaccination status, and geographic residence down to the zip code.
Out of more than 100,000 SARS-CoV-2 genomes analyzed from this project, there were 17,231 clusters made of two or more identical viral genomes. Mapping these clusters with collection date and location information allowed the researchers to understand how the clusters changed in space and time.
The clusters were then used to develop a metric dubbed relative risk (RR), which can be used to understand transmission between locations and subgroups within those locations. The RR is calculated by first measuring how many genetically identical sequence clusters are shared by the subgroups, then comparing this to how many would be statistically expected based on the sequencing effort between the sampled locations.
“A RR greater than 1 means that we are observing more sequences between the two subgroups we are looking at than expected from where sequences are coming from,” explains Dr. Tran-Kiem. “This could be within the same group (by looking at RRA,A) or between groups (by looking at RRA,B).”
Essentially, RR is a measure of how much identical sequences are enriched in groups A and B, “which we use to quantify how frequent transmission is,” says Dr. Tran-Kiem. For more about the RR method, read the explanation published on the Bedford Lab blog here.