Biostatistics Program

Data Science Affinity Group

Follow Fred Hutch Data Science on Twitter

2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (

Links to past seminars:



Current and Upcoming Seminars

July 4, July 18, August 1 - No Seminar


August 15, 2017

12:00 p.m., M1-A303

Ramkumar Hariharan, Fred Hutch

Title to come

Abstract to come


September 5, 2017

12:00 p.m., M1-A303

Jesse Bloom, Fred Hutch

Leveraging New Data Sources to Improve Models of Virus Evolution

Abstract to come


September 19, 2017

12:00 p.m., M1-A303

Bharath Sankaranarayan, Microsoft R Server

Title to come

Abstract to come


October 3, 2017

12:00 p.m., M1-A303

Garnet Anderson, Fred Hutch

Title to come

Abstract to come


Past Seminars

June 15, 2017

David Etzioni, Mayo Clinic

Outcomes Research in the Era of Big Data - Has our Reach Exceeded our Grasp?

Health outcomes data are increasingly used as the basis for quality measurement, pay-for-performance, and outcomes research.  Each element of data has a story behind it, and the extent to which these data are accurate and free of bias is questionable.  How good do the data need to be in order for them to be a valid platform for quantitative inquiry?  This talk will review the sources of outcomes data, and present recent research that characterizes the extent to which these data are a valid representation of clinical phenomena.


June 20, 2017

12:00 p.m., M1-A303

Rishabh Jain, University of Washington

Novel Methods for Modeling, Analysis, and Visualization of Multi-dimensional Data

Abstract: Creative methods for modeling, analyzing, or visualizing a dataset can reveal aspects of the data that we are unable to see otherwise.  My work aims to develop and apply such novel methods across diverse domains. In the first half of my talk, I will describe a neurally inspired model of brain regions devoted to processing visual information.  The model learns through visual experience (modeled with publicly available natural image datasets) to organize biologically realistic visual feature maps (for comparison with physiological data). A new, biologically plausible learning rule coupled with a clever visualization method revealed interesting new details about the cortex.  In the second half of my talk, I will describe the UW Medicine research IT model, which uses restricted-access health datasets to answer translational and clinical research questions.   Despite challenges at many steps, from finding and acquiring to porting and processing data, the diversity of the datasets offers significant new opportunities.  I will show examples of data science questions that are being addressed using this data, with applications for improving public health.


June 6, 2017

Xiuwen Zheng, University of Washington

SeqArray -- A storage-efficient high-performance data format for WGS variant calls

Abstract: Whole-genome sequencing (WGS) data is being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here I introduce a new WGS variant data format implemented in the R/Bioconductor package “SeqArray” ( It stores variant calls in an array-oriented manner and provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF) and 2.6 Gb (SeqArray) respectively. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.


May 16, 2017

Matt Fitzgibbon, Bioinformatics Supervisor, Fred Hutch

Data Generation and Analysis with the Genomics and Bioinformatics Shared Resource

April 18, 2017

Gang Luo, University of Washington

Automating Machine Learning Model Building with Big Clinical Data

Predictive modeling is fundamental for extracting value from large clinical data sets, or “big clinical data,” advancing clinical research, and improving healthcare. Predictive modeling can facilitate appropriate and timely care by forecasting an individual’s health risk, clinical course, or outcome. Machine learning is a major approach to predictive modeling using algorithms improving automatically through experience, but two factors make its use in healthcare challenging. First, before training a model, the user of a machine learning software tool must manually select a machine learning algorithm and set one or more model parameters termed hyper-parameters. The algorithm and hyper-parameter values used typically impact the resulting model’s accuracy by over 40%, but their selection requires special computing expertise as well as many labor-intensive manual iterations. Second, most machine learning models are complex and give no explanation of prediction results. Nevertheless, explanation is essential for a learning healthcare system.

To automate machine learning model building with big clinical data, we are currently developing a software system that can perform the following tasks in a pipeline automatically:

(a) select effective machine learning algorithms and hyper-parameter values to build predictive models;

(b) explain prediction results to healthcare researchers;

(c)   suggest tailored interventions; and

(d) estimate outcomes for various configurations, which is needed for determining a proper strategy to deploy a predictive model in a healthcare system.

April 4, 2017

Michael Zager, Fred Hutch

Oncoscape: Web Visualization Architecture

Oncoscape, a web-based data visualization platform, empowers researchers to discover novel patterns and relationship between clinical and molecular factors.  In this presentation, we will provide an overview of Oncoscape’s functionality as well as a deep dive into the technology stack that powers it.


March 21, 2017

Prutha Kulkarni, Fred Hutch

Quantity, Integrity, Accessibility, and Control: Challenges and Solutions in Clinical Data Management for Research Use

Access to real-world clinical data is key for the development of personalized therapies, new trial participants, and deeper disease progression understanding. While real-world data reflects the true diversity of treatment paths and populations of patients, it also reflects the diversity of data collection methodologies and rigor. In this presentation, we will discuss a disease specific approach to the development of a clinical data warehouse that addresses the issues that diverse data collection methodologies present. We also seek to gain insight and feedback from the audience on prioritization of next phases for development.

February 21, 2017

Johannes Lederer, University of Washington

A General Framework for Uncovering Dependence Networks

Dependencies in multivariate observations are a unique gateway to uncovering relationships among processes. An approach that has proved particularly successful in modeling and visualizing such dependence structures is the use of graphical models. However, whereas graphical models have been formulated for finite count data and Gaussian-type data, many other data types prevalent in the sciences have not been accounted for. For example, it is believed that insights into microbial interactions in human habitats, such as the gut or the oral cavity, can be deduced from analyzing the dependencies in microbial abundance data, a data type that is not amenable to standard classes of graphical models. We present a novel framework that unifies existing classes of graphical models and provides other classes that extend the concept of graphical models to a broad variety of discrete and continuous data, both in low- and high-dimensional settings. Moreover, we present a corresponding set of statistical methods and theoretical guarantees that allows for efficient estimation and inference in the framework.


February 7, 2017

Justin Guinney, Sage Bionetworks

DREAM Challenges: Crowdsourcing Solutions to Complex Biomedical Problems

The Dialogue on Reverse Engineering Assessment and Methods (DREAM) -- better known as DREAM Challenges -- is an open science, collaborative competition framework, and recognized as a successful model for motivating research teams to solve complex biomedical problems. The DREAM vision is to allow individuals and groups to collaborate openly so that the “wisdom of the crowd” provides the greatest impact on science and human health. DREAM has now successfully run over 38 Challenges in multiple disease and biological areas, including Alzheimer’s, rheumatoid arthritis, amyotrophic lateral sclerosis, olfaction, toxicology, and cancer. 


January 17, 2017

Matthew Trunnell, CIO, Fred Hutch

Naveen Ashish, Principal Data Scientist, Hutch Data Commonwealth
James Ryan, Sr Director of Engineering Operations, Hutch Data Commonwealth
Mija Lee, Manager of Big Data Engineering, Hutch Data Commonwealth
Aubree Hoover, Director of Software & Data Products, Hutch Data Commonwealth

Hutch Data Commonwealth and Data Science: Where we are and where we are going. . . .

The vision of the Hutch Data Commonwealth (HDC) is “to enable investigators to leverage all possible data in the effort to eliminate disease by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering.”  Members of the HDC, including Matthew Trunnell, Naveen Ashish and James Ryan, will spend the first part of the hour bringing people up to date on the current state of this newly formed organization, how we are structured, our current initiatives and how you can get involved.

At that point, we will open the floor to questions and comments from the audience for the HDC leadership team.  Although we meet with researchers across campus on a regular basis and we have guidance from the HDC Scientific Steering Committee, this is an opportunity for us to hear from the broader community of Fred Hutch data scientists.


January 3, 2017

Erick Matsen, Fred Hutch

Phylogenetics for Modern Data Sets, from Foundation on up

Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, yields valuable evolutionary understanding of many biological systems. Although mathematical foundations and algorithms for phylogenetic inference have been under development for many decades, many questions remain. In this talk I will describe some new mathematical results that counter common assumptions, as well as foundations for new Bayesian phylogenetic inference algorithms.