2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (email@example.com).
Links to past seminars:
May 16, 2017
Abstract to come
Abstract: Whole-genome sequencing (WGS) data is being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here I introduce a new WGS variant data format implemented in the R/Bioconductor package “SeqArray” (https://github.com/zhengxwen/SeqArray). It stores variant calls in an array-oriented manner and provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF) and 2.6 Gb (SeqArray) respectively. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.
Abstract to come
Abstract to come
April 18, 2017
Predictive modeling is fundamental for extracting value from large clinical data sets, or “big clinical data,” advancing clinical research, and improving healthcare. Predictive modeling can facilitate appropriate and timely care by forecasting an individual’s health risk, clinical course, or outcome. Machine learning is a major approach to predictive modeling using algorithms improving automatically through experience, but two factors make its use in healthcare challenging. First, before training a model, the user of a machine learning software tool must manually select a machine learning algorithm and set one or more model parameters termed hyper-parameters. The algorithm and hyper-parameter values used typically impact the resulting model’s accuracy by over 40%, but their selection requires special computing expertise as well as many labor-intensive manual iterations. Second, most machine learning models are complex and give no explanation of prediction results. Nevertheless, explanation is essential for a learning healthcare system.
To automate machine learning model building with big clinical data, we are currently developing a software system that can perform the following tasks in a pipeline automatically:
(a) select effective machine learning algorithms and hyper-parameter values to build predictive models;
(b) explain prediction results to healthcare researchers;
(c) suggest tailored interventions; and
(d) estimate outcomes for various configurations, which is needed for determining a proper strategy to deploy a predictive model in a healthcare system.
Oncoscape, a web-based data visualization platform, empowers researchers to discover novel patterns and relationship between clinical and molecular factors. In this presentation, we will provide an overview of Oncoscape’s functionality as well as a deep dive into the technology stack that powers it.
Access to real-world clinical data is key for the development of personalized therapies, new trial participants, and deeper disease progression understanding. While real-world data reflects the true diversity of treatment paths and populations of patients, it also reflects the diversity of data collection methodologies and rigor. In this presentation, we will discuss a disease specific approach to the development of a clinical data warehouse that addresses the issues that diverse data collection methodologies present. We also seek to gain insight and feedback from the audience on prioritization of next phases for development.
Dependencies in multivariate observations are a unique gateway to uncovering relationships among processes. An approach that has proved particularly successful in modeling and visualizing such dependence structures is the use of graphical models. However, whereas graphical models have been formulated for finite count data and Gaussian-type data, many other data types prevalent in the sciences have not been accounted for. For example, it is believed that insights into microbial interactions in human habitats, such as the gut or the oral cavity, can be deduced from analyzing the dependencies in microbial abundance data, a data type that is not amenable to standard classes of graphical models. We present a novel framework that unifies existing classes of graphical models and provides other classes that extend the concept of graphical models to a broad variety of discrete and continuous data, both in low- and high-dimensional settings. Moreover, we present a corresponding set of statistical methods and theoretical guarantees that allows for efficient estimation and inference in the framework.
The Dialogue on Reverse Engineering Assessment and Methods (DREAM) -- better known as DREAM Challenges -- is an open science, collaborative competition framework, and recognized as a successful model for motivating research teams to solve complex biomedical problems. The DREAM vision is to allow individuals and groups to collaborate openly so that the “wisdom of the crowd” provides the greatest impact on science and human health. DREAM has now successfully run over 38 Challenges in multiple disease and biological areas, including Alzheimer’s, rheumatoid arthritis, amyotrophic lateral sclerosis, olfaction, toxicology, and cancer.
The vision of the Hutch Data Commonwealth (HDC) is “to enable investigators to leverage all possible data in the effort to eliminate disease by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering.” Members of the HDC, including Matthew Trunnell, Naveen Ashish and James Ryan, will spend the first part of the hour bringing people up to date on the current state of this newly formed organization, how we are structured, our current initiatives and how you can get involved.
At that point, we will open the floor to questions and comments from the audience for the HDC leadership team. Although we meet with researchers across campus on a regular basis and we have guidance from the HDC Scientific Steering Committee, this is an opportunity for us to hear from the broader community of Fred Hutch data scientists.
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, yields valuable evolutionary understanding of many biological systems. Although mathematical foundations and algorithms for phylogenetic inference have been under development for many decades, many questions remain. In this talk I will describe some new mathematical results that counter common assumptions, as well as foundations for new Bayesian phylogenetic inference algorithms.