NeFut Logo NeFut
Admin Login

[CS.AI] Closing the Reflection Gap: Free Calibration Bonus for Agentic RL

Published at: 2026-06-15 22:00 Last updated: 2026-06-16 12:14
#AI #Machine Learning #Reinforcement Learning

Abstract

Recently, large language models (LLMs) are increasingly deployed as agents interacting with external environments, observing feedback such as execution results, error messages, and tool outputs. A well-functioning agent should leverage this feedback to accurately assess its own performance. However, we find a persistent reflection gap: LLM agents tend to mis-assess their outputs after observing concrete environment feedback, even for questions they answered correctly. Standard RL barely helps due to a credit-assignment mismatch.

To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate from $44.4\%$ to $7.7\%$) and task accuracy (e.g., from $75.1\%$ to $76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

Blogger's Review: The RefGRPO method presented effectively narrows the reflection gap for LLM agents, significantly enhancing their self-assessment capability and task execution accuracy. This innovation not only provides new insights for RL applications but also lays a foundation for future self-improvement processes, warranting further exploration and application.

Original Source: https://arxiv.org/abs/2606.14211

[h] Back to Home