[CS.AI] Inference Compute Shapes Frontier LLM Evaluation

As AI evaluations shift toward harder tasks involving tool use and iterative problem solving, performance becomes increasingly sensitive to the amount and allocation of compute available at test time, referred to as "inference compute." Many evaluations still report performance at a single restrictive budget, meaning low scores may reflect the evaluation setup rather than the model's underlying capability. To investigate this, we evaluate up to 12 frontier language models on seven challenging benchmarks across software engineering, mathematics, medicine, and cybersecurity.

We employ a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or minimal correctness feedback. Our findings yield three main results.

First, larger token budgets significantly improve performance across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench.

Second, fixed-budget evaluations increasingly understate frontier capability as models advance. Newer models achieve higher performance at large budgets, unlocking harder tasks and solving them more reliably.

Third, benchmarks vary in which inference-scaling methods are most effective: repeated submissions broadly enhance performance, while the value of larger token budgets, external feedback, and parallel attempts varies by benchmark.

Overall, our results indicate that benchmark scores are protocol-dependent. We argue that evaluations should report capability as a function of inference-time compute, explicitly specify protocol choices, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant contexts.

Blogger's Review: This article highlights the critical role of inference compute in evaluating large language models, stressing the need for diverse and flexible computational considerations in assessments. As models advance, fixed-budget evaluation methods clearly fail to capture their full potential, necessitating updated evaluation standards to align with technological advancements.