Vaccine and Infectious Disease Division

Novel tools for classifying genome-wide transcriptional output

As biological research has progressed into gathering and interoperating huge data sets from experiments such as genome sequencing, proteomics and clinical trials, the ability to analyze the results in a meaningful and accurate way is critical and requires specific statistical methodologies. In the last decade, the meaning of the term “transcriptome” has shifted dramatically; from analyzing gene expression of predicted coding regions (CDS) of a genome to identifying all RNA transcripts, whether predicted CDS or not. While identifying all RNA transcribed from cells at any one time is a significant biological tool, it also increases the complexity and quantity of data gathered. It is the statisticians’ job to design algorithms that can analyze these large data sets and make conclusions about underlying biological processes.

Dr. Raphael Gottardo, VIDD associate member, developed a novel algorithm, known as a changepoint model, for reducing computational complexity of previously developed changepoint models. The novel algorithm was applied to two previously published biological data sets; gene expression of Saccharomyces cerevisiae (yeast) and humans. These previous studies (focusing on the yeast study from David et al. 2006) used tiling microarrays for RNA transcript analysis. David et al. analyzed transcription using Affymetrix tiling arrays and 25mer oligonucleotides spaced every four basepairs for both strands of the genome (yeast are haploid eukaryotes), which is approximately 12 million bases. The novel computational algorithm for changepoint models uses a sequential approach where estimation is done ‘online’ as opposed to repeating all previously calculations when estimating novel parameters, thereby providing a much more efficient, rapid and cost-effective computational analysis. The results from Gottardo’s studies were compared to those from a previously published, simpler changepoint model on the same yeast data set. This new model agreed with results from the simpler study in regards to detection of CDS, however it also detected putative transcripts not found in the previous study. The overall conclusions from this study are that this novel computational framework model is a fast and powerful method for detecting RNA transcripts from whole genome sequencing experiments and increasing our knowledge of cellular transcriptional patterns that are far more complex than previously appreciated.

Caron F, Doucet A and Gottardo R. On-line changepoint detection and parameter estimation with application to genomic data. Stat Comput. 2012 22:579-595.