Large language models (LLMs) are increasingly recognized for their potential in clinical consultation tasks. However, most medical evaluations remain static, single-turn, or narrowly outcome-based, which limits their ability to reflect the sequential, uncertain, and interactive nature of real-world care. To address this, we propose AIPatient Arena, an EHR-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence.
This framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. Our findings show that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5).
Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5).
Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.
Blogger's Review: The introduction of AIPatient Arena offers a new perspective on the evaluation of LLMs in clinical settings, emphasizing the importance of multi-turn interactions and information processing. Future research should focus more on enhancing model performance in real-world applications to tackle the complexities and uncertainties inherent in healthcare.