2017 Data Science Affinity Group meetings will take place the first and third Tuesdays of each month at noon in M1-A303. We are currently scheduling seminars for the year and are interested in hearing about both the big picture ideas in Data Science, as well as hands-on/technical details about your particular slice of data science. If you are interested in presenting at one of our seminars please contact the group’s co-ordinator, Amy Leonardson (email@example.com).
Links to past seminars:
The WuXi NextCODE analysis suite provides a standard global format for storing and interrogating large genomic data. Powered by the Genomically Ordered Relational database (GORdb), our database architecture gives researchers rapid, interactive access to the cancer datasets including visualization of DNA and RNA sequence reads. The GORdb's Genome Browser allows for rapid visualization of raw BAM files, per base coverage files, and reference data annotation files to allow clinicians and researchers alike to verify results quickly and confidently. In addition, the platform also provides query tools for executing large cohort analysis to identify disease-related variants, identify individuals who are carriers of that variant, and more. In this seminar, scientists from WuXi NextCODE will demonstrate a multi-omic analysis of TCGA data using the WuXi NextCODE platform.
Abstract to come.
Abstract: The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. I will provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. I will discuss considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. I will also provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Data science and artificial intelligence are hot topics. Data scientists, described as ‘the sexiest job of the 21 century’, not only find out hidden patterns underneath the facts but also enable future predictions. On the other hand, medicine, an old profession, is full of endless surprises and unknowns even after so much research was done. The charming intersection of data science and medicine does surely attract everyone’s attention. This talk will present a real-world healthcare problem, and how a physician applies data science to tackle the problem.
Over the past sixteen years, our group brought together multidiscipline investigators and established large scale data sharing across academia and industry internationally. This groundbreaking work of integrating large collection of databases enabled researchers to answer questions otherwise not possible statistically, to design important new clinical studies, to make regulatory observations and set new standards. Five databases, with individual patient data from over 105,000 patients enrolled on over one hundred clinical trials are currently established. The data sharing and research activities are governed by the same principals and agreements across all five databases. Over the past eight years, the SHARE team has created a unified infrastructure to facilitate integration and construction of databases as well as verification and quality control. We introduce the established procedures of data sharing, inspection and integration. Furthermore, we highlight the research achievements generated by these databases through key results and publications. The presentation will conclude with new project management initiatives and future directions.
Machine Learning applied to imaging and other disciplines is a current subject of much work, discussion and even hype. Many reports describe training machines to discriminate between image subjects in tasks that we are quite capable of doing ourselves manually, and our own visual acuity is often used as the "gold standard". Machines can be shown to match our own accuracy for many such tasks, and perform them at vastly greater scales than we ever could. But what about going beyond human limits of perception? Can machines be trained to perform well in imaging problems where even subject-matter experts perform poorly? In current medical practice, manual image interpretation is the most common route to diagnosis, but this process is subjective and lacks both accuracy and repeatability. Anecdotally, the reproducibility between different pathologists and radiologists evaluating the same cases can be as low as 70%. In order to use machine learning to improve diagnostic accuracy, we need to train machines to substantially exceed our own abilities. This talk will present several case studies exploring how such systems are trained and evaluated, including an example in cancer diagnostics impacting millions of people.
What can data scientists and our aging study of aging women offer each other? In this talk I will describe efforts to expand the WHI data beyond the original epidemiologic data to incorporate genomic, geographic, mobile health and claims data. These novel data sources create new opportunities to examine disease associations and validating approaches developed with a lesser quality data.
Abstract: The session will cover an overview of the Microsoft R Server , how it enhances Open Source R through the ScaleR functions and Machine Learning algorithms, cross platform capability. The session will include a few demos that covers how you can do in database analytics, a demo of how we used DNN’s in the health field. Level- Intermediate
Abstract: Viruses such as HIV and influenza evolve rapidly to escape immunity. It has long been possible to use sequencing to monitor how these viruses change at the global scale. However, more recently it has also become possible to view the forces shaping this evolution from other perspectives. I will discuss what we can learn about influenza virus evolution from monitoring how the virus changes within individual infected humans, and by using experiments to comprehensively survey the space of possible mutations. I will also point out how the ability to study evolution from these new perspectives requires new approaches to handle and analyze the data.
Abstract: There is a great amount of information captured in physicians’ comments made during health care. Increasingly, researchers are finding valuable uses by mining and aggregating this data in clinical and translational studies which lead to improved patient care. However, most patient information that describes patient state, diagnostic procedures, and disease progress is represented in free-text form. For meaningful use, one of the challenges is to capture the rich semantics surrounding the medical concepts in partially structured clinical text. In this talk, I will summarize the past and ongoing research in my lab on statistical section segmentation to extract the structure of clinical notes, assertion analysis to capture the semantics surrounding the medical concepts, and clinical event extraction with change-of-state and anaphora resolution to create a clinical timeline of patients. I will present clinical application examples where these more sophisticated semantic representation approaches were proven to be more effective compared to the baseline bag-of-words approach. Example applications include predicting time-of-onset for critical illness phenotypes in the ICU, calculating liver cancer stages from clinical notes, extracting incidental recommendations from radiology reports, and extracting lifestyle and environmental factors from clinical notes for outcome analysis.
My presentation will have two disparate parts:
Image segmentation, feature computation and machine learning on digitalized pathology slides (DPS): our Google Cloud Platform (GCP) experience
Tumor slide assessment, even by expert pathologists, is highly subjective. Since tumor type and grade are key determinants of patient prognosis, developing an accurate and robust model to automate this process will greatly benefit the clinic. Only a handful of studies have attempted to address this in any scientifically rigorous way, and these have either suffered from lack of high-quality training sets, or from feature selection challenges. Here, we leverage a recently described cell nuclear segmentation and feature computation algorithm and use it to derive features from a high quality DPS dataset from the Cancer Genome Anatomy Project. We next evaluated the usefulness of these features in predicting tumor grade using pancreatic carcinoma DPS as a case study. All computations were done on GCP which hosted and managed the big data. In this talk, we will present preliminary results from this study including the challenges we faced with image formats, analysis, and finally describe our GCP experience.
Story of a breast cancer survivor: from treatment side effects to treading a novel data-driven path to wellness
Advances in early diagnosis and treatment of cancer have resulted in vastly improved survival rates for many forms of the disease, and there are an estimated 15.5 million cancer survivors in the US today. However, Quality of Life issues faced by cancer survivors represent a significant problem; the enormous number of survivors living with cancer and treatment-related side effects call for a systematic, quantitative approach to assess, analyze, and rationally intervene with such health issues. In this talk, we will meet Laura, a hypothetical, 55-year old woman who is about to be diagnosed with breast cancer. We will follow Laura as she passes through the different phases of her “post-cancer” life ─ from initial diagnosis and therapy to long-term survivorship. Along the way, we will learn about some of the long-term and late effects of cancer treatment that Laura has to struggle with. Finally, we will examine a novel approach based on dense dynamic data clouds that promises to pave the way to wellness for Laura.
Health outcomes data are increasingly used as the basis for quality measurement, pay-for-performance, and outcomes research. Each element of data has a story behind it, and the extent to which these data are accurate and free of bias is questionable. How good do the data need to be in order for them to be a valid platform for quantitative inquiry? This talk will review the sources of outcomes data, and present recent research that characterizes the extent to which these data are a valid representation of clinical phenomena.
Abstract: Creative methods for modeling, analyzing, or visualizing a dataset can reveal aspects of the data that we are unable to see otherwise. My work aims to develop and apply such novel methods across diverse domains. In the first half of my talk, I will describe a neurally inspired model of brain regions devoted to processing visual information. The model learns through visual experience (modeled with publicly available natural image datasets) to organize biologically realistic visual feature maps (for comparison with physiological data). A new, biologically plausible learning rule coupled with a clever visualization method revealed interesting new details about the cortex. In the second half of my talk, I will describe the UW Medicine research IT model, which uses restricted-access health datasets to answer translational and clinical research questions. Despite challenges at many steps, from finding and acquiring to porting and processing data, the diversity of the datasets offers significant new opportunities. I will show examples of data science questions that are being addressed using this data, with applications for improving public health.
Abstract: Whole-genome sequencing (WGS) data is being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here I introduce a new WGS variant data format implemented in the R/Bioconductor package “SeqArray” (https://github.com/zhengxwen/SeqArray). It stores variant calls in an array-oriented manner and provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF) and 2.6 Gb (SeqArray) respectively. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.
Gang Luo, University of Washington
Predictive modeling is fundamental for extracting value from large clinical data sets, or “big clinical data,” advancing clinical research, and improving healthcare. Predictive modeling can facilitate appropriate and timely care by forecasting an individual’s health risk, clinical course, or outcome. Machine learning is a major approach to predictive modeling using algorithms improving automatically through experience, but two factors make its use in healthcare challenging. First, before training a model, the user of a machine learning software tool must manually select a machine learning algorithm and set one or more model parameters termed hyper-parameters. The algorithm and hyper-parameter values used typically impact the resulting model’s accuracy by over 40%, but their selection requires special computing expertise as well as many labor-intensive manual iterations. Second, most machine learning models are complex and give no explanation of prediction results. Nevertheless, explanation is essential for a learning healthcare system.
To automate machine learning model building with big clinical data, we are currently developing a software system that can perform the following tasks in a pipeline automatically:
(a) select effective machine learning algorithms and hyper-parameter values to build predictive models;
(b) explain prediction results to healthcare researchers;
(c) suggest tailored interventions; and
(d) estimate outcomes for various configurations, which is needed for determining a proper strategy to deploy a predictive model in a healthcare system.
Oncoscape, a web-based data visualization platform, empowers researchers to discover novel patterns and relationship between clinical and molecular factors. In this presentation, we will provide an overview of Oncoscape’s functionality as well as a deep dive into the technology stack that powers it.
Access to real-world clinical data is key for the development of personalized therapies, new trial participants, and deeper disease progression understanding. While real-world data reflects the true diversity of treatment paths and populations of patients, it also reflects the diversity of data collection methodologies and rigor. In this presentation, we will discuss a disease specific approach to the development of a clinical data warehouse that addresses the issues that diverse data collection methodologies present. We also seek to gain insight and feedback from the audience on prioritization of next phases for development.
Dependencies in multivariate observations are a unique gateway to uncovering relationships among processes. An approach that has proved particularly successful in modeling and visualizing such dependence structures is the use of graphical models. However, whereas graphical models have been formulated for finite count data and Gaussian-type data, many other data types prevalent in the sciences have not been accounted for. For example, it is believed that insights into microbial interactions in human habitats, such as the gut or the oral cavity, can be deduced from analyzing the dependencies in microbial abundance data, a data type that is not amenable to standard classes of graphical models. We present a novel framework that unifies existing classes of graphical models and provides other classes that extend the concept of graphical models to a broad variety of discrete and continuous data, both in low- and high-dimensional settings. Moreover, we present a corresponding set of statistical methods and theoretical guarantees that allows for efficient estimation and inference in the framework.
The Dialogue on Reverse Engineering Assessment and Methods (DREAM) -- better known as DREAM Challenges -- is an open science, collaborative competition framework, and recognized as a successful model for motivating research teams to solve complex biomedical problems. The DREAM vision is to allow individuals and groups to collaborate openly so that the “wisdom of the crowd” provides the greatest impact on science and human health. DREAM has now successfully run over 38 Challenges in multiple disease and biological areas, including Alzheimer’s, rheumatoid arthritis, amyotrophic lateral sclerosis, olfaction, toxicology, and cancer.
The vision of the Hutch Data Commonwealth (HDC) is “to enable investigators to leverage all possible data in the effort to eliminate disease by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering.” Members of the HDC, including Matthew Trunnell, Naveen Ashish and James Ryan, will spend the first part of the hour bringing people up to date on the current state of this newly formed organization, how we are structured, our current initiatives and how you can get involved.
At that point, we will open the floor to questions and comments from the audience for the HDC leadership team. Although we meet with researchers across campus on a regular basis and we have guidance from the HDC Scientific Steering Committee, this is an opportunity for us to hear from the broader community of Fred Hutch data scientists.
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, yields valuable evolutionary understanding of many biological systems. Although mathematical foundations and algorithms for phylogenetic inference have been under development for many decades, many questions remain. In this talk I will describe some new mathematical results that counter common assumptions, as well as foundations for new Bayesian phylogenetic inference algorithms.