Aashna Shah

I build machine learning systems that rethink how medicine defines "normal"

PhD Candidate at Harvard University.

Graduating 2026 — Looking to disrupt healthcare with AI.

About

PhD Candidate

Systems, Synthetic & Quantitative Biology

Harvard University · 2021 — 2026

B.S. Mathematics

Northeastern University · 2016 — 2021

Co-advised by Raj Manrai and Chirag Patel, I focus on answering a deceptively simple question: what does it mean to be "normal" in medicine — and how do those definitions shape clinical decisions? Using methods ranging from statistical learning to generative modeling, I unpack assumptions historically embedded in the definition of normal and develop more precise, data-driven definitions that better reflect the diversity and heterogeneity of patients in the era of precision medicine.

I work across the full stack of clinical ML: from wrangling billions of longitudinal lab measurements across international health systems, to training transformer architectures for sequence prediction, to evaluating foundation models (GPT-4V, Gemini) for fairness in medical imaging. My work spans structured EHR data, medical images, and clinical text — always with the goal of building systems that are both technically rigorous and clinically deployable.

Scale & Data

~2B lab measurements 1.5M+ patient cohorts Multi-site international EHR, imaging, text

Models & Methods

Autoregressive transformers Vision-language models Bayesian inference Causal inference

Stack

Python PyTorch HuggingFace SQL AWS/GCP Docker

Research

In the nineteenth century, Adolphe Quetelet applied the Gaussian curve to human traits and introduced l'homme moyen — the "average man." What began as a statistical abstraction became a clinical standard: the average defined the expected, and deviation from it signaled disease. Two centuries later, this logic still governs medicine. Reference ranges, diagnostic thresholds, and risk scores are derived from population aggregates. Patients are evaluated by proximity to a mean. But what is typical for a population is an imperfect guide to what is normal for a given patient. My research revisits this tension — asking when population norms fail, what happens when we try to personalize them, and what biases emerge when we hand the task to AI.

Redefining "Normal" in Laboratory Medicine

You get bloodwork done every year. Each time, every result comes back "normal." But your values have been slowly creeping upward — a trajectory that, for you, signals something is wrong. The population-wide reference interval is too broad to notice. By the time a value finally crosses the threshold, the disease has been developing for months or years.

The obvious fix — comparing you only to yourself — overcorrects in the other direction, flagging healthy fluctuations as abnormal. Neither population averages nor purely individual baselines solve the problem alone.

We built NORMA, an autoregressive transformer trained on billions of longitudinal lab measurements, that generates reference intervals conditioned on both a patient's own history and population-level expectations for health. It catches disease signals months earlier than population intervals — without the false-positive burden of purely personalized approaches.

Interactive Demo GitHub

TransformersLongitudinal EHRReference intervalsPrecision medicine

Toward Individual-Level Features in Clinical Reference Equations

You take a breathing test at your doctor's office. The machine measures how much air you can exhale. But before your doctor interprets the result, the system adjusts your expected lung function based on your race — using a different equation depending on which box you check. This practice has no clear biological basis and dates to the 1840s.

Medical societies now recommend removing race, but simply dropping it from the equation — by averaging or refitting — doesn't address what race was standing in for. The question isn't whether to remove race. It's what to replace it with.

We developed ARC, a framework that identifies the individual-level anatomical features — sitting height, waist circumference — that race was crudely proxying for. The resulting equations are built on your body, not your demographic group, and they are both more accurate and more equitable — generalizing to populations where existing models fail.

Shah et al., 2025 Diao, Shah et al., JAMA Int Med 2026 GitHub

Health equityPulmonary functionReference equationsSpirometry

Bias and Shortcut Learning in Medical AI

You upload a photo of a suspicious mole to an AI health tool. The model gives you a confident answer. But how reliable is that answer — and would it be different if you had a different skin tone, or were older, or if the image were taken with a different camera? Foundation models like GPT-4V and Gemini Pro are increasingly used for medical image interpretation, but their accuracy varies systematically across patient demographics in ways that are invisible to the user.

These models are designed with safety guardrails to prevent clinical diagnoses — but we show these guardrails are trivially bypassed through simple prompt rephrasing.

We systematically evaluate how vision-language models behave across patient subgroups, prompt strategies, and imaging domains — and build interpretability frameworks (TRACE) that dissect what these models actually learn from medical images. The finding: models exploit acquisition protocols, pixel intensity patterns, and diagnostic labels as shortcuts, rather than learning meaningful anatomy. We identify what's needed before these models can be safely deployed: subgroup-level auditing and prompt-robustness testing as standard practice.

Shah et al., ML4H @ NeurIPS GitHub

VLMsInterpretabilityFairnessChest X-raysShortcut learning

Selected Publications

Disentangling Proxies of Demographic Adjustments in Clinical Equations

Framework replacing race with measurable anatomy in clinical equations; validated across 147K participants

A.P. Shah, J.A. Diao, E. Pierson, C.J. Patel, A.K. Manrai

In review

arXiv Code

Directing Generalist Vision-Language Models to Interpret Medical Images Across Populations

Evaluating GPT-4V and Gemini Pro for fairness across dermatology, radiology, and pathology

L. Sagers^*, A.P. Shah^*, et al. (*equal contribution)

NeurIPS 2024 Workshop on GenAI4Health

OpenReview

TRACE: A Data-Driven Framework for Explaining Subgroup Detection from Medical Imaging

Interpretability framework dissecting how deep learning models recover demographics from chest X-rays

A.P. Shah, J.A. Diao, A.K. Manrai, C.J. Patel

Manuscript in preparation

A National Survey of Patient Preferences Regarding the Use of Race in Clinical Algorithms

National survey of US adults on preferences for race in clinical decision-support

J.A. Diao, R. Movva, L. Cheng, K. Kadoma, A.P. Shah, K. Ferryman, A.K. Manrai, E. Pierson

JAMA Internal Medicine, 2026

PubMed

Advancing Medical Artificial Intelligence Using a Century of Cases

T.A. Buckley, R. Conci, P.G. Brodeur, ..., A.P. Shah, et al.

arXiv, 2025

arXiv

Automated Assessment of Large Language Models in Open-Ended Medical Prompts

T.A. Buckley, Z. Kanjee, B. Crowe, A.M. Pettinato, A.P. Shah^*, et al.

Manuscript in preparation

High-Performance Single-Cell Gene Regulatory Network Inference at Scale: the Inferelator 3.0

C.S. Gibbs, C.A. Jackson, G. Saldi, A.P. Shah, et al.

Bioinformatics, 2022

PubMed

Chimeric Fatty Acyl-Acyl Carrier Protein Thioesterases Provide Mechanistic Insight into Enzyme Specificity and Expression

M. Ziesack, N. Rollins, A.P. Shah, et al.

Applied and Environmental Microbiology, 2018

ASM

Full list on Google Scholar