Massive data sets have become commonplace in many applied fields including genomics, neuroscience, finance, information retrieval, and the social sciences, to name a few. Practitioners in these fields often want to fit complicated Bayesian models to these data sets in order to capture underlying dependencies in the data. Unfortunately, traditional algorithms for Bayesian inference in these models such as Markov chain Monte Carlo and variational inference do not typically scale to the large dat sets encountered in practice. Additional complications arise when faced with an unbounded amount of data that arrives as a stream since most existing inference algorithms are not applicable. In this talk I will discuss our recent work on developing algorithms for these two settings. First, I will describe a stochastic variational inference algorithm for hidden Markov models with provable convergence criteria. The algorithm allows modeling sequences of hundreds of millions of observations without breaking the chain into arbitrary pieces. We demonstrate the efficacy of the algorithm by using it to segment a human chromatin data set with 250 million observations achieving comparable performance to a state of the art model in a fraction of the time. Next, I will present a streaming variational inference algorithm for Bayesian nonparametric mixture models which provides an efficient nonparametric clustering algorithm for massive streaming data. We apply the method to perform online clustering a large corpus of New York Times documents.
Mobile technologies have the potential to revolutionize both the way in which individuals monitor their health as well as the way researchers are able to collect frequent, yet sparse data on participants in clinical studies. The impact of the high-resolution activity data collected, however is only beginning to be explored. In March 2015, Sage Bionetworks launched mPower, an observational smartphone-based study developed using Apple’s ResearchKit library, to evaluate the feasibility of remotely collecting frequent information about the daily changes in symptom severity and their sensitivity to medication in PD. The study interrogated aspects of this movement disorder through surveys and frequent sensor-based recordings from participants with and without Parkinson disease. These measurements provide the ability to explore classification of control participants and those who self-report having PD, as well as to begin to measure the severity of PD for those with the disease. Benefitting from large enrollment and repeated measurements on many individuals, these data may help establish baseline variability of real-world activity measurement collected via mobile phones, and ultimately may lead to quantification of the ebbs-and-flows of Parkinson symptoms.
The development of next-generation sequencing (NGS) technologies for HLA and KIR genotyping is rapidly advancing knowledge of genetic variation of these highly polymorphic loci. NGS genotyping is poised to replace older methods for clinical use, but standard methods for reporting and exchanging these new, high quality genotype data are needed. We have organized a series of Data Standards ‘Hackathon’ events with the Immunogenomic NGS Consortium: a broad collaboration of histocompatibility and immunogenetics clinicians, researchers, instrument manufacturers and software developers.
One product of this collaboration is the development of the Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) guidelines. We have also developed an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) guidelines. I will review these new developments and discuss their impact.
The long-touted field of gene expression diagnostics is finally hitting its stride, seeing a flurry of development activity and increasing acceptance from regulators, payors and doctors. Along the way, the field has departed from the study designs and training techniques imagined in the biomarker literature. I’ll describe current trends in the field as seen from the vantage point of a prominent gene expression diagnostics company, and I’ll describe training strategies that have led to successful diagnostics. I’ll also highlight methodological needs in the field.
Tumors are typically sequenced to depths of 75-100x (exome) or 30-50x (whole genome). We demonstrate that current sequencing paradigms are inadequate for tumors that are impure, aneuploid or clonally heterogeneous. To reassess optimal sequencing strategies, we performed ultra-deep (up to ~312x) whole genome sequencing (WGS) and exome capture (up to ~433x) of a primary acute myeloid leukemia, its subsequent relapse, and a matched normal skin sample. We tested multiple alignment and variant calling algorithms and validated ~200,000 putative SNVs by sequencing them to depths of ~1,000x. Additional targeted sequencing provided over 10,000x coverage and ddPCR assays provided up to ~250,000x sampling of selected sites. We evaluated the effects of different library generation approaches, depth of sequencing, and analysis strategies on the ability to effectively characterize a complex tumor. Once complete, analysis of a complex patient tumor can result in the discovery of tens, hundreds, or even thousands of potential cancer-driving alterations. However, few resources exist to facilitate prioritization and interpretation of these alterations in a clinical context. Interpreting the events from even a single case currently requires both extensive bioinformatics expertise as well as an understanding of cancer biology and clinical paradigms. This interpretation step now represents a significant bottleneck, preventing the realization of personalized medicine. To alleviate this bottleneck, we present CIViC (www.civicdb.org) as a knowledgebase for the clinical interpretation of variants in cancer. We believe that to succeed, such a resource must be comprehensive, current, community-based and open-access. CIViC allows curation of structured evidence coupled with free-form discussion for user-friendly interpretation of clinical actionability of genomic alterations.
Biomarkers that drift differentially with age between normal and premalignant tissues, such as Barrett's esophagus (BE), have the potential to improve the assessment of a patient's cancer risk by providing quantitative information about how long a patient has lived with the precursor (i.e., dwell time). In the case of BE, such biomarkers would be particularly useful because esophageal adenocarcinoma (EAC) risk may change with BE dwell time and it is generally not known how long a patient has lived with BE when a patient is first diagnosed with this condition. In this study we first describe a statistical analysis of DNA methylation data (both cross-sectional and longitudinal) derived from tissue samples from 50 BE patients to identify and validate a set of 67 CpG dinucleotides that undergo age-related methylomic drift. We introduce a Bayesian model that incorporates longitudinal methylomic drift rates, patient age, and methylation data from individually paired BE and normal squamous tissue samples to estimate patient-specific BE onset times. Our application of the model to 30 sporadic BE patients' methylomic profiles first exposes a wide heterogeneity in patient-specific BE onset times. Furthermore, independent application of this method to a cohort of 22 familial BE (FBE) patients reveals significantly earlier mean BE onset times. Our analysis supports the conjecture that differential methylomic drift occurs in BE (relative to normal squamous tissue) and hence allows quantitative estimation of the time that a BE patient has lived with BE.
The large datasets being generated by current and future astronomical surveys give us the ability to answer questions at a breadth and depth that was previously unimaginable. Yet datasets which strive to be generally useful are rarely ideal for any particular science case: measurements are often sparser, noisier, or more heterogeneous than one might hope. To adapt tried-and-true statistical methods to this new milieu of large-scale, noisy, heterogeneous data often requires us to re-examine these methods: to pry off the lid of the black box and consider the assumptions they are built on, and how these assumptions can be relaxed for use in this new context. In this talk I’ll explore a case study of such an exercise: our extension of the Lomb-Scargle Periodogram for use with the sparse, multi-color photometry expected from LSST. For studies involving RR-Lyrae-type variable stars, we expect this multiband algorithm to push the effective depth of LSST two magnitudes deeper than for previously used methods.
At Johns Hopkins and many other organizations there is a sudden and urgent need for data science training, outreach, and research. I will discuss how we have tried to tackle these issues with a bottom-up, low budget approach to try to maximize our impact by taking advantage of web resources. I will also discuss how we are working on the next steps of solidifying the presence locally at Johns Hopkins.
Metastatic breast cancer (MBC) remains one of the leading causes of cancer death in the U.S. The genomics of MBC are understudied, in large part because there is limited infrastructure to allow for patients treated in the community setting to contribute their tumor samples and clinical information to research.The Metastatic Breast Cancer Project (MBCproject.org) is a new nationwide genomics research initiative which seeks to empower patients through sharing their samples and clinical information, as well as their insights into the design and implementation of the study. In this presentation I will share the story of the development of our outreach program which connects patients with MBC around the country with genomics research performed at the Broad Institute of MIT and Harvard. We use traditional communications methods as well as social media to enroll patients, and use an iterative feedback process in order to work effectively with the MBC community to collect aggregate data, which has provided novel insights as well as unanticipated research questions.
Since launching six months ago, we have enrolled over 1900 people with MBC, who have told us about their experiences with cancer through an online questionnaire. Our direct-to-patient approach has allowed for the rapid identification of large numbers of patients with rare phenotypes, which has heretofore been extremely challenging using traditional research methods. Additional cohorts will be added to the MBCproject over time, including young women with MBC and patients with drug-resistant MBC. Ultimately, we seek to establish a broad patient-researcher partnership to accelerate genomic discoveries across multiple cancers that may serve as a means to build a new clinical and translational research infrastructure for patients with cancer.
Abstract: Adjusting for health conditions is ubiquitous in health care. The federal government, as well as health plans and provider organizations, routinely rely on risk adjustment to predict health care spending, and may be using the same formulas to assess the contribution of medical conditions to overall health care spending. Typically, these formulas are estimated with parametric linear regression. The introduction of machine learning approaches has the potential to provide both improved prediction and statistical inference. I will discuss the implementation of ensembling for plan payment risk adjustment, possibly allowing for a simplified formula, thereby reducing incentives for increased coding intensity and the ability of insurers to "game" the system with aggressive diagnostic upcoding. Additionally, I will present an evaluation of how much more, on average, enrollees with each medical condition cost after controlling for demographic information and other medical conditions using double robust techniques with ensembles. Results indicate that the health spending literature may not be capturing the true incremental effect of medical conditions, potentially leaving undesirable incentives related to prevention of disease.
Abstract: Data integration and harmonization continue to be standing problems in any biomedical data sharing effort even today. The first part of this talk focuses on biomedical data integration technologies where I will share my experiences with efforts such as the “GAAIN” project on Alzheimer’s disease research data sharing, and the “BD2K” initiative for big data knowledge discovery. I will talk about our work on building systems for automated data mapping and entity linkage in this domain, which employs techniques from machine-learning classification, semantics, and record-linkage. I will also touch upon our overall experience, including socio-technical challenges in realizing such data sharing networks.
The second part focuses on unstructured data understanding in health – on a pipeline we developed for detailed structuring of (cancer) Pathology reports for a data warehousing initiative at the UC Irvine Medical Center. A combination of open-source platforms and technologies from natural language processing, machine-learning, and semantics (ontologies) were employed to achieve this pipeline. We will also talk about a related consumer application stemming from this effort, which is about mining insights from patient conversations on (condition specific) health discussion forums on the internet or social-media.
Abstract: Advances in high throughput technologies have facilitated the collection of multiple types of omics measurements, including genomics, epigenomics, proteomics, metabolomics and more. The ultimate goal is to integrate these disparate, yet related omics data sets to gain new insights about biology and human diseases. However, despite significant progress towards the development of computational methods for individual omics data types, and some early work on data integration, integrative analysis methods for multiple types of omics data are still in their infancy. In this talk, I will discuss various modes of omics data integration and discuss statistical methods for integrative analysis of multiple types of omics measurements.
Abstract: Supervised learning techniques have been widely used in diverse scientific disciplines such as biology, genetics, and neuroscience. In this talk, I will present some new techniques for flexible learning of data with complex structure. For the first part of the talk, a new efficient regularization technique incorporating graphical structure information among predictors will be introduced. A latent group lasso penalty is applied to utilize the graph structure node-by-node. For the second part of the talk, we focus on data with multiple modalities (sources or types). In practice, it is common to have block-missing structure for such multi-modality data. A new technique effectively using all available data information without imputation will be discussed. Finally, applications for the Alzheimer's Disease Neuroimaging Initiative (ADNI) data will be used to illustrate the performance of these methods.
Abstract: The repertoire of drugs for patients with cancer is rapidly expanding, however cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to specific drugs are in high demand. There is a fair amount of data on molecular profiles from patients with cancer. The most important step necessary to realize the ultimate goal is to identify molecular markers in these data that predict the response to each of hundreds of chemotherapy drugs. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies.
In this talk, I will present two distinct machine learning techniques to resolve these challenges. These methods learn the low-dimensional features that are likely to represent important molecular events in the disease process in an unsupervised fashion, based on molecular profiles from multiple populations of patients with specific cancer type. I will present two applications of these two methods – acute myeloid leukemia (AML) and ovarian cancer. When the first method was applied to AML data in collaboration with UW Center for Cancer Innovation, a novel molecular marker for topoisomerase inhibitors, widely used chemotherapy drugs in AML treatment, was revealed. The other method applied to ovarian cancer data led to a potential molecular driver for tumor-associated stroma, in collaboration with UW Pathology and UW Genome Sciences. Our methods are general computational frameworks and can be applied to many other diseases.
Join us for a panel discussion with local leaders in the emerging field of Integrative Genomics. We will talk about what "Integrative Genomics" means, what the data can look like, the different types of methods for analysis of Integrative Genomics data, and some of the issues and questions that come up in these analyses. Please bring your questions and experience — we plan to have a very interactive dialog. Moderated by Ruth Etzioni.
The increased use of electronic and personal health records and personal mobile devices coupled with clinical genome sequencing efforts is creating many opportunities for personalized medicine. At the University of Washington, we are laying the ground work to build the informatics and information technology infrastructure to support research on personalized approaches, and we are beginning to see the early successes of these efforts. There are many opportunities in precision and personalized medicine research, from data management, big data science and engineering the new approaches to support implementation of innovative research projects. There are also many challenges, for example, whole exome and whole genome sequencing is continuing to challenge researchers with a wealth of genetic variants of unknown disease effects, and, the genetic causes of penetrance and phenotypic expressivity often have no known molecular basis. In this presentation, I will discuss our support of data for research use within UW Medicine, our efforts to build new machine learning and data science approaches using clinical datasets, and our efforts to develop new methods to interpret human genome sequences. Further, we are leveraging the crowd by organizing and participating in community challenges (critical assessments) to build a better understanding of the types of approaches that perform well in genome interpretation, and I will discuss our involvement in two critical assessment communities, the Critical Assessment of Genome Interpretation and the Critical Assessment of Functional Annotation.
How can we tell a good protocol or provider from an ineffective or inefficient one? How can we know that a new treatment enhances survival? How can we allocate health care dollars to ensure that those who care for the most complex patients receive the resources to do so? How can we find systems and strategies that achieve better-than-the usual outcomes for vulnerable populations?
Such questions require a “population health” perspective and comprehensive modeling (risk adjustment) to tease out the effect of health care when so much of what happens to a group of patients, especially in the short run, is driven by their “baseline” characteristics.
I will discuss the tools available for extracting predictors from “administrative” data, and share ideas for: identifying and addressing hidden “missingness” and biases; using important information that cannot be used as predictors in a regulatory environment (such as prior spending levels); improving the data itself; and learning how non-medical interventions can improve health.
Julia is a high-level, high-performance dynamic programming language for technical computing. It provides a sophisticated compiler that leverages LLVM, distributed parallel execution, a focus on numerical accuracy, and an extensive mathematical function library. Julia’s Base library, largely written in Julia itself, also integrates mature, best-of-breed open source C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing. Julia is already the language of choice for numerical computing in many of the best universities around the world. This talk will focus on the journey of Julia, language design, and what makes Julia fast. I will also discuss projects such as BioJulia that are pushing the envelope on bio-analytics.
For years, the promise of big data has loomed large, and many industries are now reaping the benefits – overcoming their data silos; discovering meaning in their “dark data”; collecting, analyzing and acting on data at speed; and conducting data science and advanced analytics for near-real time insights.
Medical research and precision medicine organizations are facing these challenges head-on; this session will explore how two leading research organizations are leveraging big data technologies to enhance their data science capabilities, and advance their medical research.
Gaining Sub-second Secure Access to Genomics Observations
Using Genomic Profiles to Determine Probability of Survival Based on Medication