Our first Data Science Affinity Group meeting was a great success with standing room only. The audience included biostatisticians, software engineers, IT, and computational biologists. Matthew Trunnell, our new CIO, told us the story of how he developed a Data Science and Data Engineering group at the Broad, that has become one of the leaders in genomic data engineering. He spoke at length about the need for engineering development, in concert with research, and that developing a data science capability within an institution requires a different type of structure and organization. In comparison with Broad, Fred Hutch has a great diversity of data types and he embraced the challenge and opportunity that this presents, and invited our participation. His email is email@example.com.
We present a project from the Johns Hopkins Individualized Health Initiative to support a personalized prostate cancer management program. For individuals with a diagnosis of low risk prostate cancer, active surveillance offers an alternative to early curative intervention. The success of surveillance depends on being able to effectively distinguish indolent tumors from those with metastatic potential, a characteristic that cannot be directly observed without surgical removal of the prostate. We have developed a joint hierarchical Bayesian model for prediction of an individual's latent cancer state that accommodates characteristics of data collection. Predictions can be updated in real time with an importance sampling algorithm and communicated with patients and clinicians through a decision support tool.
Researchers are often counseled to “let the facts speak for themselves,” and mistakenly assume it means to “let the data speak for themselves.” While this may hold true for a single data point, a world of high volume and high velocity data needs an interpreter and story teller. We will explore the role of data visualization in communicating our findings clearly and effectively, while respecting the data and the audience. We will review the basics of visual graphic and table design, perception and comparisons, and which types of visualizations are best for different types of data. We will then explore different toolkits, how to select the right tool for the task at hand, and the current state of the art in data visualization.
In the past several years, there have been exciting additions to the toolkit for statisticians and data analysts who work in R. Examples include RStudio, the R markdown format for dynamic reports, the Shiny web application framework, and improved integration with Git(Hub). The downside, of course, is the potential agony associated with mastering new tools and developing new workflows. Change is hard. I will give an overview of these developments and describe the costs and benefits associated with adopting new approaches to data exploration and analysis. I will also share my very positive and illuminating experiences from teaching these new tools in several graduate courses.