Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLM-based evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To address this, we introduce DriveJudge.
DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. This dataset allows us to tackle the underexplored problem of driving metric evaluation and introduces two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection.
DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.
Blogger's Review: The introduction of DriveJudge marks a significant advancement in autonomous driving evaluation methods. By integrating rule-based and semantic understanding approaches, it enhances both interpretability and accuracy, showing great potential for future applications in the autonomous driving field.