PHOTO BY CLAY EALS
Nostalgic geneticists will tell you that experiments used to take days of manual tinkering, while data analysis - the reward of hard work - might take a scant hour with pencil and paper.
Nowadays, massive experiments blast through black boxes in a few hours, leaving a scientist with massive computer files to ponder.
With the development of DNA microarrays, the gene chips that allow scientists to read the output of thousands of genes at once, along with robotics and computers, scientists tear through hefty analyses with previously unheard-of swiftness.
But jumbo experiments mean lots more data to sift through - enough to stuff dozens of lab notebooks, said Dr. Jeff Delrow, manager of the Hutch's DNA array laboratory.
"For DNA arrays involving human genes, we're talking about analyzing 18,000 segments of DNA," he said. "That's probably enough computer printout to fill a Seattle telephone book."
New statistical models
The challenge facing biologists now, say many Hutch scientists, is the need for new statistical models to extract useful information - a need that has not kept pace with the overwhelming number of experiments conducted using microarray technology.
That's why Drs. Lue Ping Zhao and Ross Prentice in the Public Health Sciences Division joined forces with Hutch lab researchers from all four scientific divisions. Their newly developed statistical methods promise to enhance the power of microarray technology to identify sets of genes that cause disease, trigger developmental processes and allow individuals to respond to drugs.
The Hutch has filed a patent application on the methods, developed by Zhao, Prentice and Dr. Linda Breeden of the Basic Sciences Division.
Until now, Zhao said, data analysis for microarrays has been limited to two predominant approaches.
"The first is simply visual inspection of the data," he said. "This is intuitive, and many biologists originally relied on this method. But for analyzing thousands of genes, this becomes both impractical and scientifically subjective."
To overcome this problem, research groups at Stanford University and the Massachusetts Institute of Technology began to use a statistical method called cluster analysis, Zhao said. This method groups genes with similar expression profiles, an indication that they may function in similar pathways.
"But a major limitation of cluster analysis," he said, "is that it doesn't use information external to the microarray data - the kinds of things that are important in solving problems of biological interest."
Relationships of activity
Genetics cannot be separated from a variety of external conditions, Zhao said. He argues that establishing relationships of gene activity with such conditions should be a primary objective of the data analysis - and that statistical programs must allow for this.
"Suppose you have 100 samples from cancer patients that you want to examine to identify genes important for disease development," he said. "People vary in age, gender, what they've eaten, disease diagnosis, history of medication. All of that information has a potential impact on the gene-expression profile. For example, you might want to ask what genetic profile correlates with females who smoke. This algorithm allows you to do that."
Biologists are eager to use microarrays for such studies, Breeden said, but current analytical tools don't provide enough information to be useful for taking the next step in the lab.
"Clustering, the predominant current strategy, only tells you which sets of genes have related expression profiles and what those profiles look like in a general sense," she said. "This is useful for finding major patterns in the data, especially the ones that you aren't expecting. But if you are looking for genes that respond in a particular way, you can look directly for those genes using statistical modeling."
Unlike clustering analysis, statistical modeling requires one to postulate a hypothesis, Zhao said.
"Making hypotheses with thousands of genes is complicated and non-intuitive, but it can be done with one or a few genes before scaling up to a bigger project," he said.
"For example, contemplating research questions with a few genes, one can come up with specific hypotheses and postulate statistical models to address them. Then we can fit models to the data, giving us a rigorous statistical assessment of whether the expression pattern we observed supports the study hypothesis.
"After feeling comfortable with the accuracy of the statistical model, we can scale up to search among thousands of genes to discover those with similar expression patterns."
On a more fundamental level, having additional statistical tools to confirm existing data is critically important, Breeden said.
"We've had to take most of the existing microarray data analyses at face value because there have been no alternative methods to interrogate the data," she said.
"With a second method based on a completely different strategy, we can analyze the data to see where we agree and where we don't. This gives us another way to verify experimental results."
As with gene sequences, public databases house results from many large-scale array experiments that serve as useful comparative information for other researchers.
Breeden, Prentice and Zhao, whose study appeared in the May 8 Proceedings of the National Academy of Sciences, compared their new algorithm to other methods used to analyze microarray experiments designed to identify cell cycle genes in yeast.
Expression of such genes, Breeden said, would be expected to oscillate at a fixed time in every cell cycle.
"This algorithm lets you ask the computer to search for profiles that fit that pattern and ignore fluctuations that are the result of noise in the data," she said. "Even though we identified 81 percent of the genes that were already known to be cell-cycle regulated, we found only about a 65 percent overlap between cell-cycle genes identified by our method and the previous method. This demonstrates the importance of having multiple tools for analyzing data."
Breeden anticipates using this method to identify coordinately regulated cell-cycle genes and ultimately elucidating the mechanisms that control their gene expression.
Zhao and PHS investigator Dr. Jeffrey Thomas worked with Dr. Jim Olson, from the Clinical Research Division and Dr. Stephen Tapscott of the Human Biology Division to test the algorithm's worthiness on a problem of clinical relevance.
The team compared the gene-expression profiles of samples from individuals with acute myeloid leukemia and acute lymphoblastic leukemia and identified 141 genes that were differentially expressed between the two groups.
Like the cell-cycle analysis, comparison of these results with previous studies using other methods revealed some significant differences. The power of the statistical algorithm was illustrated by the researchers' ability to further subdivide the patients' genetic profiles on the basis of individual genes.
Tapscott applies the method to study regulation of gene expression in normal development of muscle tissue, while Olson uses it to identify genetic profiles of brain-cancer development.
The algorithm becomes much more useful through collaborations, Zhao said, and is likely to open up new opportunities for scientists to ask meaningful biological questions - a prospect that should comfort even the most nostalgic geneticists.
"I think that traditional hypothesis-driven research, with appropriate statistical tools, will remain very useful for biologists in the post-genome era," he said.
Chen collaborates on oral-cancer genetic work
Dr. Chu Chen, an investigator in the Epidemiology Program in PHS, works with Dr. Lue Ping Zhao in a study with Dr. Stephen Schwartz, also of PHS, and with scientists in the Otolaryngology, Oral Medicine and Pathology Departments at the University of Washington.
Their goal is to identify genetic-expression profiles that correlate with the development of oral cancer, a disease with a five-year survival rate of only about 50 percent.
"So far, we have been comparing the expression patterns of normal, preneoplastic and tumor tissue on chips that let us look at 7,000 genes at once," Chen said.
"We're interested in identifying genes that are up- or down-regulated compared to normal and how their expression is regulated during tumor formation.
"Down the road, we would like to learn whether their expression is influenced by environmental factors and would thus be amenable to preventative action.
"We also would like to identify expression profiles that are correlated with clinical outcomes. This knowledge could help guide physicians in the clinical management of patients."