Abstract
Continuous evaluation of Large Language Model (LLM) products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction, and a team is paged when the score drifts down. However, the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores, making every drift alarm ambiguous between a worse product and a changed judge.
We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in {none, system, judge}. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality.
On two real judge changes, a silent version bump is detected as judge drift in 60 out of 60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed in 110 out of 120 runs at guard width 300, while the industry-default rolling z-test falsely alarms on 75% of drift-free streams. Each experiment replicates on a second domain (TL;DR summarization) with no retuning, and where the domains differ, the differences align with the race's predictions: the strict-prompt change shifts scores harder there, so the anchors fire faster, and attribution becomes perfect (240 out of 240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.
Blogger's Review: This paper successfully addresses attribution issues in LLM evaluation by introducing fixed human-labeled anchors and an innovative monitoring mechanism. It not only significantly reduces false alarm rates in the industry but also demonstrates wide adaptability and reliability across multiple domains, showcasing its important practical value and research significance.