Data Science Affinity Group

Biostatistics Program

Data Science Affinity Group

Follow Fred Hutch Data Science on Twitter

2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (

Links to past seminars:



Current and Upcoming Seminars

August 30, 2017

10:00 a.m., Behnke Suites (M1-A303/305/307)

Meliha Yetisgen, University of Washington

Extracting Semantics from Clinical Text for Secondary Use Extracting Semantics from Clinical Text for Secondary Use

Abstract: There is a great amount of information captured in physicians’ comments made during health care. Increasingly, researchers are finding valuable uses by mining and aggregating this data in clinical and translational studies which lead to improved patient care. However, most patient information that describes patient state, diagnostic procedures, and disease progress is represented in free-text form. For meaningful use, one of the challenges is to capture the rich semantics surrounding the medical concepts in partially structured clinical text. In this talk, I will summarize the past and ongoing research in my lab on statistical section segmentation to extract the structure of clinical notes, assertion analysis to capture the semantics surrounding the medical concepts, and clinical event extraction with change-of-state and anaphora resolution to create a clinical timeline of patients. I will present clinical application examples where these more sophisticated semantic representation approaches were proven to be more effective compared to the baseline bag-of-words approach. Example applications include predicting time-of-onset for critical illness phenotypes in the ICU, calculating liver cancer stages from clinical notes, extracting incidental recommendations from radiology reports, and extracting lifestyle and environmental factors from clinical notes for outcome analysis.


September 5, 2017

12:00 p.m., M1-A303

Jesse Bloom, Fred Hutch

Leveraging New Data Sources to Improve Models of Virus Evolution

Abstract to come


September 19, 2017

12:00 p.m., M1-A303

Bharath Sankaranarayan, Microsoft R Server

Title to come

Abstract to come


October 3, 2017

12:00 p.m., M1-A303

Garnet Anderson, Fred Hutch

Title to come

Abstract to come


Past Seminars

August 15, 2017

Ramkumar Hariharan, Fred Hutch

A tale of two projects: our Google Cloud experience, and cancer survivorship

My presentation will have two disparate parts:

Part I

Image segmentation, feature computation and machine learning on digitalized pathology slides (DPS): our Google Cloud Platform (GCP) experience

Tumor slide assessment, even by expert pathologists, is highly subjective. Since tumor type and grade are key determinants of patient prognosis, developing an accurate and robust model to automate this process will greatly benefit the clinic. Only a handful of studies have attempted to address this in any scientifically rigorous way, and these have either suffered from lack of high-quality training sets, or from feature selection challenges. Here, we leverage a recently described cell nuclear segmentation and feature computation algorithm and use it to derive features from a high quality DPS dataset from the Cancer Genome Anatomy Project. We next evaluated the usefulness of these features in predicting tumor grade using pancreatic carcinoma DPS as a case study. All computations were done on GCP which hosted and managed the big data. In this talk, we will present preliminary results from this study including the challenges we faced with image formats, analysis, and finally describe our GCP experience.

Part II

Story of a breast cancer survivor: from treatment side effects to treading a novel data-driven path to wellness

Advances in early diagnosis and treatment of cancer have resulted in vastly improved survival rates for many forms of the disease, and there are an estimated 15.5 million cancer survivors in the US today. However, Quality of Life issues faced by cancer survivors represent a significant problem; the enormous number of survivors living with cancer and treatment-related side effects call for a systematic, quantitative approach to assess, analyze, and rationally intervene with such health issues. In this talk, we will meet Laura, a hypothetical, 55-year old woman who is about to be diagnosed with breast cancer. We will follow Laura as she passes through the different phases of her “post-cancer” life ─ from initial diagnosis and therapy to long-term survivorship. Along the way, we will learn about some of the long-term and late effects of cancer treatment that Laura has to struggle with. Finally, we will examine a novel approach based on dense dynamic data clouds that promises to pave the way to wellness for Laura.


June 15, 2017

David Etzioni, Mayo Clinic

Outcomes Research in the Era of Big Data - Has our Reach Exceeded our Grasp?

Health outcomes data are increasingly used as the basis for quality measurement, pay-for-performance, and outcomes research.  Each element of data has a story behind it, and the extent to which these data are accurate and free of bias is questionable.  How good do the data need to be in order for them to be a valid platform for quantitative inquiry?  This talk will review the sources of outcomes data, and present recent research that characterizes the extent to which these data are a valid representation of clinical phenomena.


June 20, 2017

12:00 p.m., M1-A303

Rishabh Jain, University of Washington

Novel Methods for Modeling, Analysis, and Visualization of Multi-dimensional Data

Abstract: Creative methods for modeling, analyzing, or visualizing a dataset can reveal aspects of the data that we are unable to see otherwise.  My work aims to develop and apply such novel methods across diverse domains. In the first half of my talk, I will describe a neurally inspired model of brain regions devoted to processing visual information.  The model learns through visual experience (modeled with publicly available natural image datasets) to organize biologically realistic visual feature maps (for comparison with physiological data). A new, biologically plausible learning rule coupled with a clever visualization method revealed interesting new details about the cortex.  In the second half of my talk, I will describe the UW Medicine research IT model, which uses restricted-access health datasets to answer translational and clinical research questions.   Despite challenges at many steps, from finding and acquiring to porting and processing data, the diversity of the datasets offers significant new opportunities.  I will show examples of data science questions that are being addressed using this data, with applications for improving public health.


June 6, 2017

Xiuwen Zheng, University of Washington

SeqArray -- A storage-efficient high-performance data format for WGS variant calls

Abstract: Whole-genome sequencing (WGS) data is being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here I introduce a new WGS variant data format implemented in the R/Bioconductor package “SeqArray” ( It stores variant calls in an array-oriented manner and provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF) and 2.6 Gb (SeqArray) respectively. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.


May 16, 2017

Matt Fitzgibbon, Bioinformatics Supervisor, Fred Hutch

Data Generation and Analysis with the Genomics and Bioinformatics Shared Resource

April 18, 2017

Gang Luo, University of Washington

Automating Machine Learning Model Building with Big Clinical Data

Predictive modeling is fundamental for extracting value from large clinical data sets, or “big clinical data,” advancing clinical research, and improving healthcare. Predictive modeling can facilitate appropriate and timely care by forecasting an individual’s health risk, clinical course, or outcome. Machine learning is a major approach to predictive modeling using algorithms improving automatically through experience, but two factors make its use in healthcare challenging. First, before training a model, the user of a machine learning software tool must manually select a machine learning algorithm and set one or more model parameters termed hyper-parameters. The algorithm and hyper-parameter values used typically impact the resulting model’s accuracy by over 40%, but their selection requires special computing expertise as well as many labor-intensive manual iterations. Second, most machine learning models are complex and give no explanation of prediction results. Nevertheless, explanation is essential for a learning healthcare system.

To automate machine learning model building with big clinical data, we are currently developing a software system that can perform the following tasks in a pipeline automatically:

(a) select effective machine learning algorithms and hyper-parameter values to build predictive models;

(b) explain prediction results to healthcare researchers;

(c)   suggest tailored interventions; and

(d) estimate outcomes for various configurations, which is needed for determining a proper strategy to deploy a predictive model in a healthcare system.

April 4, 2017

Michael Zager, Fred Hutch

Oncoscape: Web Visualization Architecture

Oncoscape, a web-based data visualization platform, empowers researchers to discover novel patterns and relationship between clinical and molecular factors.  In this presentation, we will provide an overview of Oncoscape’s functionality as well as a deep dive into the technology stack that powers it.


March 21, 2017

Prutha Kulkarni, Fred Hutch

Quantity, Integrity, Accessibility, and Control: Challenges and Solutions in Clinical Data Management for Research Use

Access to real-world clinical data is key for the development of personalized therapies, new trial participants, and deeper disease progression understanding. While real-world data reflects the true diversity of treatment paths and populations of patients, it also reflects the diversity of data collection methodologies and rigor. In this presentation, we will discuss a disease specific approach to the development of a clinical data warehouse that addresses the issues that diverse data collection methodologies present. We also seek to gain insight and feedback from the audience on prioritization of next phases for development.

February 21, 2017

Johannes Lederer, University of Washington

A General Framework for Uncovering Dependence Networks

Dependencies in multivariate observations are a unique gateway to uncovering relationships among processes. An approach that has proved particularly successful in modeling and visualizing such dependence structures is the use of graphical models. However, whereas graphical models have been formulated for finite count data and Gaussian-type data, many other data types prevalent in the sciences have not been accounted for. For example, it is believed that insights into microbial interactions in human habitats, such as the gut or the oral cavity, can be deduced from analyzing the dependencies in microbial abundance data, a data type that is not amenable to standard classes of graphical models. We present a novel framework that unifies existing classes of graphical models and provides other classes that extend the concept of graphical models to a broad variety of discrete and continuous data, both in low- and high-dimensional settings. Moreover, we present a corresponding set of statistical methods and theoretical guarantees that allows for efficient estimation and inference in the framework.


February 7, 2017

Justin Guinney, Sage Bionetworks

DREAM Challenges: Crowdsourcing Solutions to Complex Biomedical Problems

The Dialogue on Reverse Engineering Assessment and Methods (DREAM) -- better known as DREAM Challenges -- is an open science, collaborative competition framework, and recognized as a successful model for motivating research teams to solve complex biomedical problems. The DREAM vision is to allow individuals and groups to collaborate openly so that the “wisdom of the crowd” provides the greatest impact on science and human health. DREAM has now successfully run over 38 Challenges in multiple disease and biological areas, including Alzheimer’s, rheumatoid arthritis, amyotrophic lateral sclerosis, olfaction, toxicology, and cancer. 


January 17, 2017

Matthew Trunnell, CIO, Fred Hutch

Naveen Ashish, Principal Data Scientist, Hutch Data Commonwealth
James Ryan, Sr Director of Engineering Operations, Hutch Data Commonwealth
Mija Lee, Manager of Big Data Engineering, Hutch Data Commonwealth
Aubree Hoover, Director of Software & Data Products, Hutch Data Commonwealth

Hutch Data Commonwealth and Data Science: Where we are and where we are going. . . .

The vision of the Hutch Data Commonwealth (HDC) is “to enable investigators to leverage all possible data in the effort to eliminate disease by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering.”  Members of the HDC, including Matthew Trunnell, Naveen Ashish and James Ryan, will spend the first part of the hour bringing people up to date on the current state of this newly formed organization, how we are structured, our current initiatives and how you can get involved.

At that point, we will open the floor to questions and comments from the audience for the HDC leadership team.  Although we meet with researchers across campus on a regular basis and we have guidance from the HDC Scientific Steering Committee, this is an opportunity for us to hear from the broader community of Fred Hutch data scientists.


January 3, 2017

Erick Matsen, Fred Hutch

Phylogenetics for Modern Data Sets, from Foundation on up

Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, yields valuable evolutionary understanding of many biological systems. Although mathematical foundations and algorithms for phylogenetic inference have been under development for many decades, many questions remain. In this talk I will describe some new mathematical results that counter common assumptions, as well as foundations for new Bayesian phylogenetic inference algorithms.