[CS.AI] Bayesian Inference and Decision Audits in Frontie...

Abstract

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over $1{,}000$ systems is compatible with two pre-terminal histories, yielding times of $23.03$ or $75.13$ to reach within $0.05$ of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

Blogger's Review: This paper delves into the complexities of public AI evaluations using Bayesian inference, highlighting the critical impact of data selection biases on result interpretation. The proposed audit protocol offers a more reliable validation framework for future AI evaluations, making it a noteworthy contribution to the field.

[CS.AI] Bayesian Inference and Decision Audits in Frontier AI Evaluations

Abstract