2018 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A305/7. We are currently scheduling seminars for the Fall and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s coordinator, Amber Khoury (firstname.lastname@example.org).
Links to past seminars:
Abstract. We developed a method that enables comparing personal genomes in microseconds, even when the genomes were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Our method first reduces genomes to 'fingerprints', from which the original genome data cannot be reconstructed; the method thus has significant implications for privacy-preserving genome analytics.
We also created a domain-agnostic method for comparing any semi-structured data (e.g., in JSON format). I will present several applications of this general-purpose method and discuss their implications and future opportunities.
Abstract. External control arms built from real-world databases could potentially support regulatory and coverage decisions, enabling more efficient clinical evidence generation; however, many questions remain regarding the reliability and comparability of outcomes derived from real-world data versus clinical trials. Carrie Bennette will discuss the challenges, opportunities and learnings from attempting to replicate the outcomes observed in the control arms of recent trials in oncology using longitudinal data from a curated electronic medical records database. Carrie will also discuss her experience as a data scientist who transitioned from academia to Flatiron Health, a healthcare technology company working to accelerate cancer research and improve outcomes for cancer patients.
The SciComp team supports Fred Hutch's science teams by providing high-performance computing, scientific software support, cloud computing access and consulting, Unix database services and provisioning of cloud-based No-SQL databases, scientific data management, Linux/Unix/HPC/Storage/Cloud training and consulting, and Linux desktop support.
Cell maps enable causal, mechanistic analysis of -omics data
Mr. Magaret helps lead and administer the operations of the VIDD Bioinformatics core, fulfilling the roles of data analyst, scientific liaison, and personnel/project manager. His primary problem space is vaccine evaluation and development, and his analytic projects involve the analysis and data mining of high-dimensional biological and clinical data (including viral sequences, human genetic data, immunological assay data, and microarrays), via machine learning and statistical methods.
Ben Busby, Ph.D, is the Genomics Outreach Coordinator and Bioinformatics Training Lead at the National Library of Medicine. He has an extensive background in bioinformatics teaching, mentorship, and technical training, including over three years' experience hosting successful data science hackathons at the National Institutes of Health (https://github.com/NCBI-Hackathons). He has further facilitated the establishment of new data science and bioinformatics training courses, seminars, and training events nation-wide.
For his talk, Ben will discuss modern tools for the practice of data science in biomedicine. He will cover a full range of fundamentals, including common database structures and APIs, data extraction methods for popular repositories, high-throughput methods for data analysis, downstream pairing of datasets with general frameworks and systems such as Galaxy and Bioconductor, and options for educators to coordinate and systematize training in the form of seminars and hackathons. Ben will contextualize this overview of data science topics with insights from his extensive teaching and outreach experience, for educators and trainees alike.
The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called “tensor decomposition” to impute many experiments simultaneously. Tensor decomposition learns a low-rank representation of the epigenome that captures latent patterns in ChIP-seq and DNase-seq experiments from the Roadmap Epigenomics data corpus. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human-accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.
Modern (i.e., double entry) financial accounting is the art and science of tracking human activity by constructing stock and flow statements using a non-dissipative (conservation) axiom called the fundamental identity (FI) that governs the evolution of the stock variables within the system. Many accounting measures are forward-looking and therefore inherently noisy, and because of the FI, this noise permeates throughout the system and affects all accounting measurements. Moreover, since the equation of the world is not known, the structure of the DGP generating is not known and because of data limitations must often be estimated from scant data. Consequently to be usable for control purposes, accounting data needs to be interpreted with great care and accountants need a much more thorough grounding in data science than is currently imparted in the conventional undergraduate or masters’ accounting curriculum.
In addition to being a fertile source of difficult estimation problems, accounting also offers interesting data organization challenges. The financial statement of the average US publicly traded firm contains hundreds of pages of technically dense and often highly interconnected financial and non-financial measures and qualitative assertions. Compiling these statements efficiently and correctly requires the deployment of data science methods that consist of more than “800 spreadsheets being shared over the sneakernet within the corporate finance group.” The fundamental challenges here are “addressability” and “augmentability” i.e., ensuring every datum used to compile the ultimate report is recallable at will (addressability) and, adding further “texture” to an already complex database poses while ensuring completeness, consistency and accuracy (augmentability).
In my talk I will illustrate these challenges with examples from my current teaching and research and hope to learn about shared interests and state of the art research methods that could help accountants do a better job of addressing these challenges.
Accurate identification of somatic variation is critically important in Oncology where somatic variation may be used to guide treatment decisions. However, accurate identification of somatic variation remains difficult due to its low allele frequency and sequence noise. We developed TNscope, a haplotype-based somatic variant caller, to provide highly accurate identification of somatic variation.
Sentieon TNscope provides significant improvements over existing tools. New algorithms for local de novo assembly enable full evaluation of all possible haplotype candidates, resulting in increased sensitivity for low allele frequency variant candidates. All variant candidates are evaluated by rigorous PairHMM-based analysis without down-sampling. These variant candidates are further annotated with novel features for improved variant filtering. Structural variants are identified through a combination of split-read and paired-read methods. Filtering of SNVs and Indels is performed with a pre-trained machine-learning model for increased accuracy. The engineering version of this method was used in the most recent ICGC-TCGA DREAM Mutation Calling challenge, and ranked first on the leaderboard for all three categories: SNV, Indel, and SV.
To evaluate the performance of TNscope, we compared Sentieon TNhaplotyper -- a tool that uses the same mathematical models and matches the results of the Broad Institutes MuTect2, with Sentieon TNscope with a post-processing model trained on an in-silico mixture of HG002 and HG001. We evaluated the performance of both tools on a tumor-normal pair of GIAB samples with an in-silico mixture of HG005 and HG004 as tumor and HG004 as normal, with tumor fractions of 0.10, 0.15, and 0.20, and both tumor and normal samples at 100x depth. TNscope significantly outperforms TNhaplotyper. On the three tumor fractions, TNhaplotyper achieves F1-scores of (59.8, 73.0, 77.0) for SNPs and (39.7, 48.6, 52.3) for indels, while TNscope achieves F1-scores of (91.7, 97.4, 98.7) for SNPs and (86.9, 93.2, 95.2) for indels.
In this talk, I will discuss Oregon Health and Science University’s computational platform for precision oncology. This platform uses several complementary computational approaches: automated data analysis pipelines, a data management system that aggregates clinical, imaging, and omics datasets, and integrative methods for combining information across multiple assays to identify treatments and understand mechanisms of resistance in cancer. The Galaxy workbench plays a key role in OHSU’s precision oncology platform, making it possible to run, reproduce, and share all data analysis pipelines, from primary univariate analyses to downstream multivariate analyses. Galaxy (http://galaxyproject.org) is a scientific analysis workbench used by thousands of scientists worldwide to analyze genomic, proteomic, imaging, and other large biomedical datasets. Galaxy’s user-friendly, web-based interface makes it possible for anyone, regardless of their informatics expertise, to create, run, and share large-scale robust and reproducible analyses. Based on our experiences with this platform and with Galaxy, we have identified opportunities and challenges in developing computational approaches for precision oncology.
The Infrastructure Strategy group is responsible for the strategic analysis to support and enable the continued growth critical to Facebook’s infrastructure organization, while providing independent analysis to achieve the optimal value for most Infrastructure costs. To do so, the team analyzes user behavior and infrastructure data (from our compute, network and storage tiers) and makes recommendations that guides when/where and how we build our infrastructure.
Three topics will be covered: capacity demand forecasting, network congestion detection and scaling continuous push with machine learning
Scaling Code Release with Machine Learning
In this talk we will discuss how we use Machine Learning to scale code release at Facebook. We use Machine Learning to identify risky revisions which are later picked for extra scrutiny. This has helped us reduce test failures in trunk and keep our release schedules on track.
Detecting Traffic Congestions
Network traffic congestion is a problem that affects our costumer engagement experience. In this talk, we will see how we can use regression models to predict and detect traffic congestion before it becomes a major issue.
Capacity Demand Modeling and Forecasting
How much infrastructure will we need in order to connect the whole world? How many data centers will we have to build, and when? These are important and difficult questions that we need to answer in order to enable Facebook’s growth and mission: to give people the power to build community and bring the whole world closer together. Predicting what the users’ demand will look like in the long term is a hard problem. This talk will go through explaining some of the metrics we use to tackle this problem and how we approach forecasting in our team.
Abstract: In order to understand the basis of wellness and disease, we and others have pursued a global approach termed ‘systems medicine’. The defining feature of systems medicine is the collection of diverse longitudinal data for ‘healthy’ individuals, to assess both genetic and environmental determinants of health and their interactions. We report the generation and analysis of longitudinal multi-omic data for 108 individuals over the course of a 9-month study called the Pioneer 100 Wellness Project (P100). This study included whole genome sequencing; and at three time points measurements of 218 clinical laboratory tests, 643 metabolites, 262 proteins, and abundances of 4,616 operational taxonomic units in the gut microbiome. Participants also recorded activity using a wearable device (Fitbit).
Using these data, we generated a multi-omic correlation network and identified communities of related analytes that were associated with physiology and disease. We calculated polygenic scores for 127 traits and diseases using effect estimates from published genome-wide association studies (GWAS). By including these polygenic scores in the multi-omic correlation network, we identified molecular correlates of polygenic disease risk in undiagnosed individuals. For example, the polygenic risk score for inflammatory bowel disease (IBD) was negatively correlated with plasma cystine in this unaffected cohort. Abnormally low levels of cystine have previously been associated with IBD in case-control studies. Our results suggest that lower levels of blood cystine may be more common in individuals with higher genetic risk for IBD, even before the disease manifests.
We have expanded these analyses to a larger cohort of more than 2000 individuals enrolled in a commercial wellness program (Arivale, Seattle, WA) for whom similar longitudinal multi-omic data have been collected. Using this larger cohort, we identified additional associations with cumulative genetic risk for disease in a ‘healthy’ population. These results illustrate how multi-omic longitudinal data will improve understanding of health and disease, especially for the detection of early transition states.
Abstract: The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. I will provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. I will discuss considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. I will also provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
In this introductory level whiteboard talk/discussion, we will explore some of the basic aspects of genomics and related areas. Topics will include: origins and nature of genomics data, basic data storage formats, scale of this data, functional genomics and other omics data types, challenges in analysis. We will conclude this talk with a discussion on data security and privacy.