'Wizards' of computational science

Software engineers, many from Microsoft, help ease proteomics-data bottleneck, advance center's early detection, fundamental research
Brendan MacLean, Matthew Bellew and Mark Igra
Dr. Martin McIntosh's team includes (clockwise from lower left) software engineers . Photo by Todd McNaught

The white, windowless proteomics laboratories located in the bowels of the Weintraub building hum with activity 24 hours a day, seven days a week, as protein-sorting machines known as mass spectrometers do their work. Churning through hundreds of samples each week, the instruments spit out crucial data for studies that range from the development of blood tests for early diagnosis of cancer to understanding how yeast cells age.

The problem, said Dr. Phil Gafken, staff scientist and manager of the Proteomics Shared Resource, is that there is a lot of data — so much, in fact, that the computers can't keep up or extract all the information that's needed for proteomics research, the field of study that involves analysis of large mixtures of proteins.

"In my facility, we have two mass specs, and if you do the numbers on how quickly the samples are processed and how long it takes for the computers to analyze the data, we can be in situations where we can collect data faster than we can get it analyzed," he said. "For example, things slow down even more for the proteomics group working on the early-detection project, because human samples take longer to be analyzed than yeast samples. It's been a real problem."

A solution arrived last year in the form of a team of software engineers — most of whom are ex-Microsoft employees — that has been aptly described by many of the researchers who now depend on them as "computational wizards." Under the direction of Dr. Martin McIntosh, the six-member group has eased the proteomics-data bottleneck and made new advances that are moving the center's research forward. What's more, the computational methods they are developing could serve as the method of choice for proteomics investigators around the world.

"The volumes and complexity of data we are dealing with — particularly for the work we are doing to develop diagnostic blood tests for cancer — are enormous," said McIntosh, an investigator in the Public Health Sciences Division and principal investigator of a National Cancer Institute-sponsored consortium dedicated to discovering proteins that will make these tests possible. "Every experiment generates enough data to fill up an iPod, but until we have approaches to access the information contained in that data, we cannot make progress."

McIntosh's team has a professional pedigree that would be the envy of any software company; some team members were even profiled in Fortune magazine when they left Microsoft to form their own company, which garnered awards from PC Magazine and other trade journals. The team members are:

  • Jimmy Eng, staff scientist and team leader, is a world renowned leader in proteomics and former senior software engineer at the Institute for Systems Biology. He has developed the analytic approach that makes much of proteomics possible, including the algorithms at the core of the COMET and SEQUEST programs. Eng continues to receive 20 percent of his support from the Institute for Systems Biology, where he serves as an affiliate member of their computational group.
  • Mark Igra, staff scientist, is one of two original developers of Endnote and program manager for development of Microsoft Excel and is now leading a national protein-database development effort.
  • Matthew Bellew, staff scientist, was a lead developer of Microsoft Access and SQL server and is now an emerging national leader in mass spectrometry-signal processing.
  • Adam Rauch, scientific software engineer, was a lead developer of Microsoft's Visual Basic and is leading development of a data-mining platform that is already being adopted by several proteomics laboratories internationally.
  • Brendan MacLean, scientific software engineer, created several of Microsoft's Web-backed database platforms and is now part of an international group of quantitative scientists and software engineers developing open-source protein sequence algorithms.
  • Ted Holzman, systems analyst/programmer, has years of experience developing applications for genomics analysis and is now leading the integration of many genomics approaches into proteomics.

How do creators of commercially successful office software become experts on proteomics? "Much of proteomics is a complex data-management problem, and that is something that we have a lot of experience with," Bellew said. "Although the scale of the problem is new to science, the scale of the data problems happens every day in business, and so it does not intimidate us."

That's not to say that the team members, other than Eng and Holzman, who previously developed bioinformatics-software methods, haven't had to learn some biology on the fly. They've worked closely with Gafken, Dr. Heidi Zhang, proteomics staff scientist for the early detection project, and other individual investigators in order to understand the key scientific goals of the mass-spectrometry experiments.

"Their ability to manage projects is phenomenal," Gafken said. "They can take a huge problem, slice it into understandable bits, and solve it, often giving us new ideas about what types of questions we can ask of our data; it is a huge resource for proteomics at the center."

Mass spectrometry is used to analyze complex mixtures of proteins, such as blood serum. A tube of serum contains perhaps hundreds of thousands of different kinds of proteins, which may vary in their relative abundance up to a billion-fold or more. Once a sample enters a mass spectrometer, the proteins are separated according to their size and electric charge. In order to identify proteins of interest, this sorted mixture must use complex algorithms and search against massive databases that store information about all known proteins from a particular organism. It's this step, Gafken said, where the process slows down, and new ideas are needed.

Gafken said. "With the analysis 'pipeline' that the team is creating, data analysis is much faster. New analysis tools that extract more and more information are being integrated into the system all the time."

The other significant inventions the team has made are data-viewing tools, which allow individual investigators to use the new algorithms and to manipulate their data through a Web-based interface. This step helps a scientist determine whether a sample contains a protein that might be a good candidate for developing a diagnostic test, and allows researchers from around the globe to share data and work together.

Resource development

Igra and Bellew were the first of the group recruited by McIntosh, who knew Igra socially. "I sent them a job description and asked them to recommend candidates, but I really intended to attract their interest," he said. "They took the bait." Igra and Bellew identified their former colleagues Rauch and MacLean, as other excellent candidates.

"The initial plan was for the group to translate into usable software the quantitative approaches developed by other leading proteomics-research laboratories," McIntosh said, "Instead we found they have such talent for applying complex analysis methods that those groups are now adopting our problem-solving tools (algorithms)."

For example, Bellew has in six months developed a set of algorithms for evaluating mass-spectrometry data that appears to outperform methods developed over several years at other leading research institutes.

In addition, they are working to create tools to access and compare information that is stored in databases around the world, an achievement that could save investigators hours or days searching through individual databases at each step of an experiment.

Eng followed this initial group as a key member to lead the team, bringing with him substantial experience with mass spectrometry. "Jimmy is a world leader in proteomics. With him joining the team, the group really took off," McIntosh said.

All of these resources, McIntosh said, are being developed into open-source resource that will be shared freely with other institutions. "The idea is to develop a base platform that can be used by collaborators around the world, who could then also add their own improvements," he said.

That means making sure the analysis algorithms work well, but are also as polished enough so that the laboratory researchers — not just the quantitative scientists or programmers — can use them, said Igra.

"In research science, typically when something gets to the point at which you can publish it, it's considered done," he said. "We want to create resources that are useful to many more people than just the investigators who work here, which means taking things to a higher level of completion."

Although a software engineer might use the same skills to manage data points that originate at a cancer-research institute or, say, a department store, the team members clearly appreciate the potential payoff of their work at the center — and the chance to immerse themselves in a new field.

"We spent many years building software for other programmers," Igra said. "We wanted to build something that scientists will use for a good purpose. Plus, we've got the opportunity to learn a ton."

Help Us Eliminate Cancer

Every dollar counts. Please support lifesaving research today.