Data Science Affinity Group

Biostatistics Program

Data Science Affinity Group

Follow Fred Hutch Data Science on Twitter

2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (

Links to past seminars:




Current and Upcoming Seminars


February 20, 2018 - No Seminar


March 6, 2018 - No Seminar


March 20, 2018

12:00 p.m.

David Clausen, Facebook

Title to come

Abstract to come


April 3, 2018

12:00 p.m.

Tim Durham, University of Washington

Title to come

Abstract to come


April 17, 2018

12:00 p.m.

Rajib Doogar, University of Washington, Bothell

Title to come

Abstract to come


April 24, 2018

12:00 p.m., M1-A305/307

Dirk Petersen, Fred Hutch

AWS at Fred Hutch

Abstract to come

Past Seminars

February 6, 2018

Andrew Magis, Arivale

Analysis of 'normal' individuals using dense, dynamic, personal data clouds

Abstract: In order to understand the basis of wellness and disease, we and others have pursued a global approach termed ‘systems medicine’. The defining feature of systems medicine is the collection of diverse longitudinal data for ‘healthy’ individuals, to assess both genetic and environmental determinants of health and their interactions. We report the generation and analysis of longitudinal multi-omic data for 108 individuals over the course of a 9-month study called the Pioneer 100 Wellness Project (P100). This study included whole genome sequencing; and at three time points measurements of 218 clinical laboratory tests, 643 metabolites, 262 proteins, and abundances of 4,616 operational taxonomic units in the gut microbiome. Participants also recorded activity using a wearable device (Fitbit).

Using these data, we generated a multi-omic correlation network and identified communities of related analytes that were associated with physiology and disease. We calculated polygenic scores for 127 traits and diseases using effect estimates from published genome-wide association studies (GWAS). By including these polygenic scores in the multi-omic correlation network, we identified molecular correlates of polygenic disease risk in undiagnosed individuals. For example, the polygenic risk score for inflammatory bowel disease (IBD) was negatively correlated with plasma cystine in this unaffected cohort. Abnormally low levels of cystine have previously been associated with IBD in case-control studies. Our results suggest that lower levels of blood cystine may be more common in individuals with higher genetic risk for IBD, even before the disease manifests.

We have expanded these analyses to a larger cohort of more than 2000 individuals enrolled in a commercial wellness program (Arivale, Seattle, WA) for whom similar longitudinal multi-omic data have been collected. Using this larger cohort, we identified additional associations with cumulative genetic risk for disease in a ‘healthy’ population. These results illustrate how multi-omic longitudinal data will improve understanding of health and disease, especially for the detection of early transition states.


January 16, 2018

Bill Noble, UW Genome Sciences and Computer Science

Machine Learning Applications in Genetics and Genomics

Abstract: The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. I will provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. I will discuss considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. I will also provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.


January 9, 2018

Ramkumar Hariharan, Fred Hutch

Introduction to Genomics

In this introductory level whiteboard talk/discussion, we will explore some of the basic aspects of genomics and related areas. Topics will include: origins and nature of genomics data, basic data storage formats, scale of this data, functional genomics and other omics data types, challenges in analysis. We will conclude this talk with a discussion on data security and privacy.