Evaluating Conversational AI for Medical Diagnosis and Management
The medical interview has been termed “the most powerful, sensitive, and versatile instrument available to the physician.” While Large Language Models (LLMs) have achieved expert-level scores on medical board examinations, these static benchmarks fail to capture the essence of clinical practice: the ability to intelligently and compassionately acquire information under conditions of uncertainty. To bridge this gap, we must evaluate AI through frameworks that mirror the complexity of human practice—most notably the Objective Structured Clinical Examination (OSCE), a validated gold standard for assessing clinical competence in medical trainees.
In this talk, I will discuss AMIE (Articulate Medical Intelligence Explorer), a research program from Google dedicated to developing and robustly evaluating AI capabilities for clinical reasoning and dialogue. Moving beyond medical "question-answering," in this project we have leveraged randomized blinded crossover designs and validated patient-actor simulations to benchmark AI performance across the clinical spectrum.
I will focus on three major milestones in this research. First, I will present our Nature study "Towards Conversational Diagnostic AI", which demonstrated that in 159 text-based consultations, the system’s diagnostic accuracy and conversational skills were rated higher than those of primary care physicians (PCPs) by both specialist clinicians and patient-actors. Second, I will discuss our work "Towards Conversational AI for Disease Management", which extends these capabilities to longitudinal care. In a study of multi-visit consultations, we found the system’s sequential reasoning and adherence to clinical practice guidelines to be on par with or superior to PCPs. Third, I will present a prospective feasibility study of 100 patients in an ambulatory primary care clinic which evaluated AMIE with real patients prior to their urgent care appointments. We observed high patient satisfaction and clinical reasoning quality comparable to PCPs, while also identifying areas where human clinicians continue to outperform AI, such as the practicality and cost-effectiveness of management plans.
I will conclude by briefly touching upon AMIE’s potential to democratize subspecialty expertise in oncology and cardio-genetics and other future facing directions. By shifting the evaluative paradigm toward dynamic, human-centric interactions, this research provides a blueprint for how AI can move from a repository of facts to a meaningful participant in the clinical process.
Anil Palepu is a Research Scientist at Google Research, where he works on LLM-based systems for medical and biomedical applications. Since May 2023, he has been a core contributor to AMIE (Articulate Medical Intelligence Explorer), a project focused on evaluating the diagnostic and management reasoning of these systems in medical conversational settings. His work has primarily focused on the improvement of these systems through various methods including synthetic data generation, model post-training, agent design, and LLM-based auto-evaluation.
Prior to working at Google, Anil completed his PhD at Harvard-MIT Health Sciences & Technology, advised by Dr. Andrew Beam. During his PhD, his research focused on characterizing and improving self-supervised image-text models such as CLIP for medical applications, including topics such as local alignment, shortcut learning, and conformal prediction. He previously obtained his Bachelors and Masters in biomedical engineering at Johns Hopkins University, where he worked on various clinical applications of machine learning, including epilepsy localization in EEG, surgical instrument tracking in video, and precision embolization catheter development.
Hybrid, but in-person attendance is encouraged. Please contact biostat@fredhutch.org for a virtual link.