[CS.AI] Revolutionizing Evaluation: Preference-Based Traj...

Abstract

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles.

We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

Blogger's Review: The new preference-based trajectory evaluation method promises to address the statistical inefficiencies inherent in traditional evaluations. By retaining information about partial progress, it significantly enhances the accuracy of system comparisons. This innovation offers a fresh perspective on performance assessment for agentic systems and merits further exploration in future research.

[CS.AI] Revolutionizing Evaluation: Preference-Based Trajectory Assessment

Abstract