Abstract
LLM-as-a-Judge is widely used to rank model outputs, train reward models, and populate public leaderboards, yet its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks across 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), conducting 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations.
Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini exhibits a significant first-position bias (72% A-majority, p = 0.024). Meanwhile, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, leading to a pairwise-pointwise gap: judges often choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference.
Beyond within-judge instability, cross-judge agreement is only 76% ($\text{kappa} = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Since both judges are from a single provider, cross-provider replication remains an important next step.
Blogger's Review: This study highlights the limitations of LLMs as judges, particularly in high-stakes scenarios. The findings indicate that single-trial judgments may lead to unreliable outcomes, suggesting that multi-trial and randomization methods should be adopted to enhance evaluation accuracy and reliability. Future research should focus on cross-provider validation as a crucial next step.