How should AI systems be evaluated for health?

Originally appeared on Twitter and LinkedIn.

HealthBench evaluates model responses against physician-written rubric criteria.

Almost every week, a new study comes out testing how LLMs perform in health scenarios. Some look at whether models can pass medical exams. Others test how they answer specific clinical questions or tasks.

These studies can sometimes reach very different conclusions about safety and performance. For people trying to understand how capable or safe systems like ChatGPT actually are, that can be confusing.

Healthcare is a high-stakes domain. The methodology behind evaluation matters. I wanted to share a few thoughts to help people understand how these evaluations work, and how they are evolving.

First: traditional evaluation approaches.

Historically, one common way researchers evaluated AI in medicine was by testing whether models could score well on medical exam questions. In other words: could the model get an “A” on a multiple-choice test designed for physicians?

Earlier generations of models were already fairly good at this. But in the real world, people don’t interact with AI by asking exam questions. Health conversations are much more dynamic and nuanced.

Another common evaluation method looks at single-question clinical scenarios—for example: asking an LLM about “a 50 year old patient with a persistent cough…” and analyzing its short, pre-structured answer, like a possible risk score.

But this also misses something important. These are more like fixed tests—where the model has to pick an answer. In the real world, health interactions with ChatGPT happen through conversations, not pre-defined scenarios and tasks. Usually, these are multi-chat conversations, where users provide sometimes limited information and chat back and forth with a model many times, allowing it to ask follow-up questions and gather more context before offering guidance (similar to how a clinician would).

There’s another challenge too: many studies evaluate older or deprecated models. Given how quickly AI systems are improving, results from those models often don’t reflect how current systems behave today.

So how are AI labs starting to rethink evaluation?

Over the past few years, many labs have started moving beyond some of these traditional methods and focusing on evaluations that better reflect real-world interactions (this is something our team helped pioneer!).

A couple years ago, our team kicked off an effort to work with a cohort of hundreds of physicians around the world and understand what a good response from ChatGPT in a health context actually looks like.

Based on what physicians said mattered most, we built HealthBench, which we use to evaluate our models. Rather than measuring whether a model can pass a test or answer a single question correctly, HealthBench looks at whether it responds in a way physicians consider safe and appropriate. It includes performance across more than 48,000 criteria physicians care about – Did the model ask for more context? Escalate potential emergencies appropriately? Express uncertainty when needed? Did it communicate in a way users could actually understand? –– and was tested across 5,000 realistic conversations similar to real health chats people have with ChatGPT.

A HealthBench example: a conversation, a candidate response, and the rubric criteria used to grade it.

Every OpenAI model is evaluated using HealthBench before release, and we publish these results in our system cards. Health expertise is also integrated into every major stage of model training at OpenAI.

On 1,000 particularly challenging HealthBench conversations, our GPT-5 models are also industry-leading across the many dimensions measured—including our nano and open-weight models.

HealthBench Hard results: GPT-5 models lead on challenging, realistic health conversations.

Other evaluation suites out there that are also built to reflect real-world tasks, including Stanford’s MedHELM (35 benchmarks, 121 distinct clinical tasks) and MedMarks (20 benchmarks, 46 models evaluated), also consistently find that GPT-5 models outperform other models across workflows for health professionals and patients.

Independent evaluations on Stanford's Med-HeLM also show GPT-5 models leading across the field.

Independent MedMarks open-ended and verifiable evaluations also show GPT-5 models leading.

But there are still studies that take a different approach.

One example circulating recently is a study that looks at ChatGPT’s ability to triage serious conditions. The study concludes that ChatGPT underestimated seriousness, but the evaluation design does not meaningfully assess real-world performance.

In that example, researchers presented synthetic scenarios and asked the model to choose a single multiple-choice answer after one message—explicitly preventing it from asking follow-up questions. The study found underestimation of seriousness in some cases, but those cases came from only two clinical scenarios, with roughly 85% coming from a single scenario.

When we re-ran the paper’s evaluation on the authors’ released dataset but allowed the model to respond naturally and ask follow-up questions, we saw much better performance: In over 80% of the apparently concerning cases, the model asked for additional context so it could produce a better informed response.

We also see on HealthBench Consensus conversations that reflect real-world use of LLMs, GPT-5.4 correctly recommends immediate care in emergency cases more than 99% of the time, while also avoiding unnecessary escalation about 99% of the time when physicians agree escalation isn’t needed.

This is a big difference. And it highlights a broader question: how should AI systems actually be evaluated for health?

Our view is that evaluations should reflect how these systems are actually used: open-ended interaction, realistic workflows, and ultimately real-world studies. But should there be shared standards? What eval methodologies are common, and which best reflect real-world use? How can we clearly communicate to the public the performance and limitations of health AI? As new, more capable models arrive, how can evaluations keep up?