Data Science Affinity Group

Biostatistics Program

Data Science Affinity Group

Follow Fred Hutch Data Science on Twitter

2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (

Links to past seminars:




Current and Upcoming Seminars


March 20, 2018

12:00 p.m.

David Clausen, Facebook

Data Science and Infrastructure Strategy at Facebook

The Infrastructure Strategy group is responsible for the strategic analysis to support and enable the continued growth critical to Facebook’s infrastructure organization, while providing independent analysis to achieve the optimal value for most Infrastructure costs. To do so, the team analyzes user behavior and infrastructure data (from our compute, network and storage tiers) and makes recommendations that guides when/where and how we build our infrastructure.

Three topics will be covered: capacity demand forecasting, network congestion detection and scaling continuous push with machine learning

Scaling Code Release with Machine Learning

In this talk we will discuss how we use Machine Learning to scale code release at Facebook. We use Machine Learning to identify risky revisions which are later picked for extra scrutiny. This has helped us reduce test failures in trunk and keep our release schedules on track.

Detecting Traffic Congestions

Network traffic congestion is a problem that affects our costumer engagement experience. In this talk, we will see how we can use regression models to predict and detect traffic congestion before it becomes a major issue.

Capacity Demand Modeling and Forecasting

How much infrastructure will we need in order to connect the whole world? How many data centers will we have to build, and when? These are important and difficult questions that we need to answer in order to enable Facebook’s growth and mission: to give people the power to build community and bring the whole world closer together. Predicting what the users’ demand will look like in the long term is a hard problem. This talk will go through explaining some of the metrics we use to tackle this problem and how we approach forecasting in our team.



March 27, 2018

12:00 p.m.

Jeremy Goecks, OHSU

Title to come

Abstract to come


April 3, 2018 

2:00 p.m, M1-A305

Don Freed, Sentieon

Title to come

Abstract to come



April 17, 2018

12:00 p.m.

Rajib Doogar, University of Washington, Bothell

Title to come

Abstract to come


April 24, 2018

12:00 p.m., M1-A305/307

Dirk Petersen, Fred Hutch

AWS at Fred Hutch

Abstract to come


May 1, 2018

12:00 p.m.

Tim Durham, University of Washington

Title to come

Abstract to come


May 15, 2018

12:00 p.m.

Raphael Gottardo, Fred Hutch

Title to come

Abstract to come


June 5, 2018

12:00 p.m.

Craig Margaret, Fred Hutch

Title to come

Abstract to come

Past Seminars

February 6, 2018

Andrew Magis, Arivale

Analysis of 'normal' individuals using dense, dynamic, personal data clouds

Abstract: In order to understand the basis of wellness and disease, we and others have pursued a global approach termed ‘systems medicine’. The defining feature of systems medicine is the collection of diverse longitudinal data for ‘healthy’ individuals, to assess both genetic and environmental determinants of health and their interactions. We report the generation and analysis of longitudinal multi-omic data for 108 individuals over the course of a 9-month study called the Pioneer 100 Wellness Project (P100). This study included whole genome sequencing; and at three time points measurements of 218 clinical laboratory tests, 643 metabolites, 262 proteins, and abundances of 4,616 operational taxonomic units in the gut microbiome. Participants also recorded activity using a wearable device (Fitbit).

Using these data, we generated a multi-omic correlation network and identified communities of related analytes that were associated with physiology and disease. We calculated polygenic scores for 127 traits and diseases using effect estimates from published genome-wide association studies (GWAS). By including these polygenic scores in the multi-omic correlation network, we identified molecular correlates of polygenic disease risk in undiagnosed individuals. For example, the polygenic risk score for inflammatory bowel disease (IBD) was negatively correlated with plasma cystine in this unaffected cohort. Abnormally low levels of cystine have previously been associated with IBD in case-control studies. Our results suggest that lower levels of blood cystine may be more common in individuals with higher genetic risk for IBD, even before the disease manifests.

We have expanded these analyses to a larger cohort of more than 2000 individuals enrolled in a commercial wellness program (Arivale, Seattle, WA) for whom similar longitudinal multi-omic data have been collected. Using this larger cohort, we identified additional associations with cumulative genetic risk for disease in a ‘healthy’ population. These results illustrate how multi-omic longitudinal data will improve understanding of health and disease, especially for the detection of early transition states.


January 16, 2018

Bill Noble, UW Genome Sciences and Computer Science

Machine Learning Applications in Genetics and Genomics

Abstract: The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. I will provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. I will discuss considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. I will also provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.


January 9, 2018

Ramkumar Hariharan, Fred Hutch

Introduction to Genomics

In this introductory level whiteboard talk/discussion, we will explore some of the basic aspects of genomics and related areas. Topics will include: origins and nature of genomics data, basic data storage formats, scale of this data, functional genomics and other omics data types, challenges in analysis. We will conclude this talk with a discussion on data security and privacy.