Big data in cancer research

Graphic illustration of numeral ones and zeros falling from clouds into computers — With the advent of technologies to read the genome, take high-resolution medical images and run large numbers of tests very quickly, big data in health research is becoming a reality. Sharing data even within research teams has been a big challenge, but new cloud-computing infrastructure is on the horizon to help. Graphic illustration by Kim Carney / Fred Hutch News Service

Editor’s note: Read Part 1 of this two-part story, published last week.

Dr. Soheil Meshinchi got notification that the data from the first 10 patients on his team’s new genomic study was ready. He thought he’d just open the file on his office computer and take a look.

His machine refused to open it. The file was just too big.

There were 1,013 patients’ data files to go.

That study, published Monday in the journal Nature Medicine, is the first-ever comprehensive look at the genomic landscape of acute myeloid leukemia, an aggressive blood cancer, in children and young adults. Eight years after the study’s launch, results of its complex data analyses are packed into dense graphics in the paper and its bulging supplemental files. They point toward a consequential conclusion: There are critical biological differences in this cancer that vary depending on the age of the patient, as well as numerous subtypes within those age groups — findings that are changing how the disease is treated in younger patients. And it’s just a start: The team is gathering more in-depth data and expects to publish more detailed analyses in the years to come.

A pediatric AML physician-scientist at Fred Hutchinson Cancer Research Center and a leader of the study, a national initiative called TARGET AML, Meshinchi has watched a revolution occur over the course of his career in the power of technologies for uncovering the secrets of patients’ genomes.

Pediatric AML specialist Dr. Soheil Meshinchi led TARGET AML. Photo by Robert Hood / Fred Hutch

“In my wildest dream, I would not have thought we’d be able to get to where we are,” Meshinchi said. “When we first started this 20 years ago, we were doing gene-by-gene [sequencing]. We can probably generate more data in a week now than we could in those first 15 years.”

The pace of technological innovation is so rapid now that the biggest challenge the team faced was keeping up with it all, said Dr. Hamid Bolouri, who coordinated the work of the team’s six data-analysis experts.

“We were collecting data as the technology was improving and as our understanding of what the technology could tell us was improving,” the computational biologist said from his office at Fred Hutch. “And so you’re constantly having to sit back and rethink: ‘Wait, what is this data telling me?’”

Crashing servers, renting trucks

The study involves scientists at 12 institutions across North America and Europe. This arrangement is increasingly common for big research studies. A recent report from the National Science Foundation found that 60 percent of science and engineering papers published in 2013 were authored by researchers from multiple institutions, many of these cross-border collaborations. This was a marked increase from even just three years prior, when only about 45 percent of papers had multi-institution authorship.

Scientific studies with big teams, like TARGET AML, benefit from the pooling of scientific resources and brainpower, but they also can face a big challenge: how to get everybody’s data together for analysis.

Dr. Hamid Bolouri and colleagues developed new computational methods to work with the study's huge data sets. Photo by Robert Hood / Fred Hutch

“When you finish your analysis, it’s all in an Excel file. It’s easy,” Bolouri said. “But where you start? It’s terabytes and terabytes of data.”

To visualize what that means, let’s do a quick calculation: The TARGET AML team has up to three tissue samples from each of the 1,023 patients in the study — so far. (They will complete analysis of samples from another 1,200 patients within the next year). Collectively, the team’s molecular profiling and other analyses will yield up to three terabytes of data per sample. The final data set will be in the ballpark of about 8,000 terabytes, they estimate.

It would take 5,825,416,000 floppy disks to store that much data. Lined up, they’d circle the Earth almost 13 times.

The old-school file-transfer technology the team was originally relying on to send their data from one institution to another kept crashing. And when it did work, it could take weeks for a file to get to its destination.

At one point early on, feeling a bit desperate, Meshinchi wondered if it would be easier to personally cart the data from institution to institution in a rented semitrailer. But the cost of the external hard drives alone would be hundreds of thousands of dollars, he calculated.

The types of problems that the team faced have become common, with the advent of technologies to read the genome, take high-resolution medical images and run large numbers of tests very quickly, all of which produce very large data files very fast. More and more health records are becoming digitized, and wearable digital health devices and even social media profiles offer vast new data sets relevant to health research. According to a report from a 2016 workshop held by the European Medicines Agency, the near future will likely bring well over 1,000 terabytes of data per person over their lifetime.

Figuring out new methods for complex data analysis

The TARGET AML researchers eventually got their data into secure cloud storage. They still had to download the huge files locally to analyze them, however. And the data crunching was so complex that the analytical whizzes on the team had to develop new methodologies to “lose the noise and not miss any of the important features,” as Bolouri put it.

But the challenges inspired valuable solutions both for the researchers and for patients. The sheer amount of data, and diverse types of data they gathered, gave their analysis power to reveal unique aspects of the biology of this disease in young patients that hadn’t been seen before. The strength of their findings is already improving the way young AML patients are cared for and kicking off the development of targeted new treatment approaches. And the team’s newly developed methods can now be applied in other studies that use multiple massive sets of data generated by different high-throughput technologies, Bolouri explained.

“Knowing now that these different technologies are complementary and give us different perspectives on the same facts, that really helps us in terms of computational analysis down the line,” he said.

‘An exciting time’

photo portrait of matthew trunnell — Matthew Trunnell, Fred Hutch's CIO, is working to enable discovery by making research data more accessible. Photo by Robert Hood / Fred Hutch

Such large data sets, and the challenge in sharing and analyzing them in a secure way, “is the new norm for multi-institutional studies,” said Fred Hutch Chief Information Officer Matthew Trunnell.

TARGET AML is just one of many large, publicly funded projects to mine the inner workings of our cells to improve human health, each one generating its own tens to thousands of terabytes of data. What if we could put all this data together, at the fingertips of researchers working to cure deadly diseases like cancer?

The potential in this proposition is widely recognized. One of the recommendations of the Beau Biden National Cancer Moonshot’s Blue Ribbon Panel was for a national-level, cloud-based “ecosystem” for data sharing and analysis. Such a system would serve as “the foundation for developing powerful new integrative analyses, visualization methods, and portals that will not only enable new insights into cancer initiation, progression, and metastasis, but also inform new cancer treatments and help initiate new clinical trials,” the panel wrote in its 2016 report.

The pieces are being put into place now.

Earlier this month, Trunnell attended the kickoff for a new public-private effort, organized by the National Institutes of Health, called the NIH Data Commons Pilot Phase Consortium. The nascent effort brings together the NIH — the world’s largest public funder of biomedical research — data experts, private-sector cloud computing providers and others to create and test ways to keep data and data-analysis tools in the cloud so a researcher anywhere can find data and work with it, securely. Trunnell is on its external panel of consultants.

Individual institutes within the NIH are also beginning to build their own systems for cloud-based data storage and analysis, such as the National Cancer Institute’s development of a Cancer Research Data Commons. And at Fred Hutch, for example, Trunnell is spearheading the development of the data-science hub his team calls the Hutch Data Commonwealth. “All of these have to play together at the end of the day,” Turnnell said. But for now, the fact that these tremendous tools for biomedical data science are being put into place is the most important thing, he said.

“Things are moving now, the pieces are starting to shift in terms of people’s thinking,” he said. “So it’s an exciting time.”

Solving ‘the harder questions’ with new tools

With the partnership of the three big commercial cloud providers, the difficult challenge of making a large-scale, interoperable cloud-based data commons for medical research will be solvable, he said. “And only with that problem solved can we get enough data in the same place to have the statistical power we need to answer some of the harder questions that we’re wrestling with” — like finding more cures for cancer.

There will still be some big challenges in this undertaking, such as protecting the privacy of patient data and the unpredictable, rapid increases in the size of biomedical data as technology improves in ways that aren’t easy to foresee. These challenges simply don’t exist to the same degree in the tech world or in high-energy physics, fields that have already largely solved their own big data problems, he said.

But in just four years, Trunnell envisions, a researcher should be able to log on to a cloud-based system from her office computer and “have some visibility” into every data set of any type that was gathered using public funds, he said. The researcher’s team can apply for access to work with the data using cutting-edge data analysis tools that themselves are stored and maintained within the same cloud system. No more massive investments in data storage at every single research institution. No more desperate thoughts about semitrailers just to share data from collaborative projects. Not even any more downloading data to analyze it, a process that can take days for data sets this big even with a screamingly fast internet connection.

By next summer, Trunnell expects the consortium behind the pilot phase of the NIH Data Commons to have its first basic prototype of a system that can do what they’re envisioning. And they’ll go from there.

“If we don’t do it in four years, we’re taking too long,” he said.

How do you think big data will change cancer care? Tell us on Facebook.