Researchers have introduced EHRBench, an automated benchmark containing nearly 1 million question-answer items derived from real electronic health records, designed to evaluate how reliably large language models can support clinical decision-making tasks like diagnosis, treatment selection, and prognosis. The benchmark was constructed using an EHR-LLM-knowledge base pipeline that automatically converts patient encounter data into structured templates while filtering out hallucinations, and testing of 30+ LLMs reveals consistent capability gaps that highlight what work remains to make these systems clinically safe.
Why it matters: As healthcare organizations rapidly adopt LLMs for clinical support, this real-world benchmark provides the AI and healthcare industries a critical standardized tool to measure—and ultimately improve—the reliability of language models on the decision-making tasks that matter most in practice.