About
Co-advised by Raj Manrai and Chirag Patel, I focus on answering a deceptively simple question: what does it mean to be "normal" in medicine — and how do those definitions shape clinical decisions? Using methods ranging from statistical learning to generative modeling, I unpack assumptions historically embedded in the definition of normal and develop more precise, data-driven definitions that better reflect the diversity and heterogeneity of patients in the era of precision medicine.
I work across the full stack of clinical ML: from wrangling billions of longitudinal lab measurements across international health systems, to training transformer architectures for sequence prediction, to evaluating foundation models (GPT-4V, Gemini) for fairness in medical imaging. My work spans structured EHR data, medical images, and clinical text — always with the goal of building systems that are both technically rigorous and clinically deployable.
Research
In the nineteenth century, Adolphe Quetelet applied the Gaussian curve to human traits and introduced l'homme moyen — the "average man." What began as a statistical abstraction became a clinical standard: the average defined the expected, and deviation from it signaled disease. Two centuries later, this logic still governs medicine. Reference ranges, diagnostic thresholds, and risk scores are derived from population aggregates. Patients are evaluated by proximity to a mean. But what is typical for a population is an imperfect guide to what is normal for a given patient. My research revisits this tension — asking when population norms fail, what happens when we try to personalize them, and what biases emerge when we hand the task to AI.
Redefining "Normal" in Laboratory Medicine
You get bloodwork done every year. Each time, every result comes back "normal." But your values have been slowly creeping upward — a trajectory that, for you, signals something is wrong. The population-wide reference interval is too broad to notice. By the time a value finally crosses the threshold, the disease has been developing for months or years.
The obvious fix — comparing you only to yourself — overcorrects in the other direction, flagging healthy fluctuations as abnormal. Neither population averages nor purely individual baselines solve the problem alone.
We built NORMA, an autoregressive transformer trained on billions of longitudinal lab measurements, that generates reference intervals conditioned on both a patient's own history and population-level expectations for health. It catches disease signals months earlier than population intervals — without the false-positive burden of purely personalized approaches.
Toward Individual-Level Features in Clinical Reference Equations
You take a breathing test at your doctor's office. The machine measures how much air you can exhale. But before your doctor interprets the result, the system adjusts your expected lung function based on your race — using a different equation depending on which box you check. This practice has no clear biological basis and dates to the 1840s.
Medical societies now recommend removing race, but simply dropping it from the equation — by averaging or refitting — doesn't address what race was standing in for. The question isn't whether to remove race. It's what to replace it with.
We developed ARC, a framework that identifies the individual-level anatomical features — sitting height, waist circumference — that race was crudely proxying for. The resulting equations are built on your body, not your demographic group, and they are both more accurate and more equitable — generalizing to populations where existing models fail.
Bias and Shortcut Learning in Medical AI
You upload a photo of a suspicious mole to an AI health tool. The model gives you a confident answer. But how reliable is that answer — and would it be different if you had a different skin tone, or were older, or if the image were taken with a different camera? Foundation models like GPT-4V and Gemini Pro are increasingly used for medical image interpretation, but their accuracy varies systematically across patient demographics in ways that are invisible to the user.
These models are designed with safety guardrails to prevent clinical diagnoses — but we show these guardrails are trivially bypassed through simple prompt rephrasing.
We systematically evaluate how vision-language models behave across patient subgroups, prompt strategies, and imaging domains — and build interpretability frameworks (TRACE) that dissect what these models actually learn from medical images. The finding: models exploit acquisition protocols, pixel intensity patterns, and diagnostic labels as shortcuts, rather than learning meaningful anatomy. We identify what's needed before these models can be safely deployed: subgroup-level auditing and prompt-robustness testing as standard practice.
