Gone but not lost: a computational method to infer missing data measurements

From Tal Einav of the Bloom Lab, Basic Sciences Division

It’s a nightmare scenario for any scientist. You’ve come up with a great idea, and a great way to test it. You’ve worked long and hard, poured money and resources into collecting a dataset sufficient to answer your question. You’re eager to dive into analyzing your data and watch the conclusions materialize. But when the results come back, some of the data is missing or of low quality. Enthusiasm is replaced with anxiety and uncertainty. Do you have enough data to find the answers you seek? Do you have the energy and resources to try again? Missing data comes in many forms. Experiments fail. Hard drives (and animals, and cell lines) die. Sometimes the data was collected using different methods and cannot be easily merged in analysis. Sometimes there is simply too much potential data to feasibly collect it all. Dr. Tal Einav, postdoctoral fellow with Dr. Jesse Bloom in Fred Hutch’s Basic Sciences Division, is no stranger to such concerns. But rather than despair, in a new research article published in Cell Systems, Dr. Einav and collaborator Dr. Brian Cleary of the Broad Institute have established new methods to computationally infer missing data values.

Dr. Einav’s research interest is in understanding how evolutionary changes in viral protein sequences affect antibody binding. This work is crucial in creating effective vaccines - it is commonly used to inform development of annual flu vaccines, and has been used by the Bloom lab to identify changes in the SARS-CoV-2 spike protein that promote immune evasion. But once one takes into account the large numbers of viral variants and antibodies (or sera – the blood fluid containing antibodies) to be tested, the number of measurements needed to complete these studies can quickly reach into the millions. “For example, each year the influenza surveillance network surveys ~100,000 viruses against ~10 reference sera to determine whether an antigenically distinct virus strain has emerged,” says Dr. Einav. In such cases, it is often not feasible to collect all desired measurements. Further, such data is often collected across multiple studies using different methodologies, and combining their results into a comprehensive dataset can be challenging. Rather than relying on the old strategy of “just collect more data,” Drs. Einav and Cleary sought a computational solution to these problems. For this purpose, they turned to a method known as matrix completion. “With matrix completion, the goal is to use a relatively small number of observations…to identify low-rank features that can be used to infer missing values,” they explained. “Biologically, such inferences are possible because antibodies cross-react and often exhibit similar behavior against similar viruses…the low-dimensional nature of these interactions suggests that a serum’s measurement against a few viruses can predict its behavior against many other strains.”

matrix completion
Matrix completion strategy to infer missing antibody-virus measurements Image provided by Dr. Tal Einav

The authors selected multiple large but incomplete datasets for influenza and for HIV on which to test their strategy. They first asked whether matrix completion could accurately predict missing values within a single study. Rather than immediately trying to infer the missing data, they first removed additional values (forming a validation set), ran their algorithm, and asked how accurately the validation set was recovered. While the algorithm unsurprisingly performed better the more data it was given, they found that even in cases in which 50% or more of the data had been removed, the error rate of the algorithm was similar to experimental error. For the flu study, they concluded that performing only 42% of the total measurements would have been sufficient to accurately infer the rest, while for the HIV study that number was as low as 27%.

Next, the authors merged the influenza datasets of several studies that used similar experimental methods to ask whether missing data could be inferred across studies, which is a daunting feat given the many issues with reproducibility. Surprisingly, they found that such cross-study inferences were also quite accurate. Matrix completion was less accurate when combining the disparate HIV datasets, which used different experimental methods. But even in this case, they noted, “the imputed data may contain useful information. For example, these predictions could differentiate between the weakest and strongest of the missing [antibody-virus] interactions.”

Aiming to determine how far they could push the limits of their method, the authors next asked whether existing data could be used to predict interactions of antibodies with new viral variants that emerge in the future. For this analysis, they used HIV data collected prior to 2017 to predict data collected after 2018. Again, this method could robustly predict the strongest antibody-virus interactions; only 20% of measurements were required to identify 80% of the strongest interactions in the post-2018 data.

Finally, the author stepped out of testing mode and applied matrix completion to computationally expand a recent dataset. For this work, they applied data from their model influenza study, which was published in 2014, to a new influenza study from 2021, to predict how sera from the later study would interact against many new viruses. There is hope that this predictive data will inform small-scale, targeted experimental testing to identify antibodies with potency against a greatly expanded set of viruses.

While this will be anathema to many scientists, missing data may in fact be a good thing, says Dr. Einav. “Nearly every dataset is plagued by missing values (e.g. a reagent may run out or an experiment goes awry), which can make it difficult to analyze data. But given an algorithm that predicts these missing values, we can flip this perspective and think of missing values as an asset. For time- and resource-intensive experiments, you don't want to carry out all measurements if you don't have to. And what we see time and again is that you don't need to - datasets often have a few simple underlying patterns, and once we learn those patterns we only need a fraction of the values to predict the entire dataset.” Discouraged scientists everywhere, rejoice.

This work was supported by the Damon Runyon Cancer Research Foundation and the Merkin Institute.

Einav T, Cleary B. Extrapolating missing antibody-virus measurements across serological studies. Cell Syst. 2022 Jul 20;13(7):561-573.e5. doi: 10.1016/j.cels.2022.06.001. Epub 2022 Jul 6. PMID: 35798005.