The huge datasets generated using today’s bioinformatics methods are useless unless they undergo robust quality control. Researchers need to know that they can trust their underlying data to make meaningful conclusions. This is especially important in the field of microbiome research, wherein studies measure fluctuations in a multitude of species of bacteria in hundreds to thousands of human subjects. The way these data are collected, handled, and processed can introduce what are known as “batch effects” – systematic variation between “batches” of data that can lead to false positives (and thus incorrect conclusions) and interfere with modeling rigor. Dr. Wodan Ling, a postdoctoral researcher in Dr. Michael Wu’s laboratory in the Public Health Sciences Division, recently published a new method in Nature Communications that aims to eliminate batch effects from large datasets.
“The central objective of ConQuR is to remove batch effects while preserving real signals in associations,” wrote Dr. Ling and her colleagues. For this method, a “batch” is considered a complete experiment, or dataset. “ConQuR works directly on taxonomic read counts and generates corrected read counts that enable all of the usual microbiome analyses (visualization, association analysis, prediction, etc.) with few restrictions,” wrote Dr. Ling. “ConQuR assumes that for each microorganism, samples share the same conditional distribution if they have identical intrinsic characteristics…regardless of in which batch they were processed. This does not mean the samples have identical observed values, but they share the same distribution for that microbe. Then operationally, for each taxon and each sample, ConQuR non-parametrically models the underlying distribution of the observed value, adjusting for key variables and covariates, and removes the batch effects relative to a chosen reference batch.” This way, ConQuR allows researchers to correct for the variation introduced by sample handling, library generation, and other factors that skew data on a sample-wide scale. “It works for both classical batch effects (where samples are processed in different batches) as well as for vertical data integration (where we are pooling samples from different studies, with each study treated as a batch),” wrote Dr. Michael Wu.
Previously developed methods, such as “ComBat” and “MMUPHin”, have been successfully applied to other genomic technologies outside the microbiome research space. “One day,” Dr. Wu told me, “I read a preprint from a colleague at Boston University who developed an approach for batch correcting RNAseq data. The "aha" moment came, when I forwarded the paper to Wodan (who was a new post doc in my group) and asked her, "Can we do better?" The problem with the other work was that it was for RNAseq data and made a lot of assumptions about the data that do not hold for microbiome data.” For instance, these methods often assume continuous, normally distributed outcomes whereas microbiome data are “over-dispersed [extremely variable] and heterogeneous, with complex distributions.” Thus, these methods are not flexible enough to handle the analyses required by Drs. Ling, Wu, and others for use with microbiome analyses.
To test their method against current tools in the field, the authors applied multiple analysis techniques, including ConQuR, to data from the HIV re-analysis consortium (HIVRC), a dataset built of data from a large number of individual studies. “For methods work, it's not enough to having a working solution but rather you need to comprehensively validate the approach and demonstrate that it works,” Dr. Wu told me. “This involved thousands upon thousands of computer simulations under countless potential scenarios. This also involved analyzing many different data sets to demonstrate that the approach worked.” In the case of the HIVRC datasets, the authors found that ConQuR “considerably removed the study variation in the raw count…the means of the 10 studies came almost together, and the dispersions and higher-order features are much more aligned” – meaning that inferences made from this dataset will be more statistically rigorous when first cleaned via ConQuR. Notably, the models produced by ConQuR were both more sensitive and more specific than any other method tested when used to predict HIV status of individuals across these 10 studies.
Like all batch removal procedures, ConQuR still has some limitations. “First, ConQuR requires the batch variable to be known,” wrote Dr. Wu. “In practice, there are many situations where the systematic variability in the data arises from cryptic or unknown sources. Thus, we are working on building a batch correction approach in situations where the source is unknown -- this is related to work that Jeff Leek started some time ago. Second, one eventual goal is to bring microbiome data into the clinic by developing signatures for predicting outcomes, e.g. response to immunotherapy or relapse following bone marrow transplant. However, how to incorporate batch correction into construction of predictive and prognostic signatures remains unclear. We are excited to be working in these areas.” The authors showed that the accuracy of ConQuR improves as sample size increases, leading them to speculate that “as microbiome profiling studies continue to increase in size, the performance of ConQuR will only continue to improve.”
This work was funded by the National Heart, Lung, and Blood Institute, the HIV Microbiome Re-analysis Consortium, and the National Institutes of Health.
W Ling, J Lu, N Zhao, A Lulla, AM Plantinga, W Fu, A Zhang, H Liu, H Song, Z Li, J Chen, TW Randolph, WLA Koay, JR White, LJ Launer, AA Fodor, KA Meyer, and MC Wu. 2022. Batch effects removal for microbiome data via conditional quantile regression. Nature Communications. 13:5418. https://doi.org/10.1038/s41467-022-33071-9