Abstract
Recently, large language models (LLMs) are increasingly deployed as agents interacting with external environments, observing feedback such as execution results, error messages, and tool outputs. A well-functioning agent should leverage this feedback to accurately assess its own performance. However, we find a persistent reflection gap: LLM agents tend to mis-assess their outputs after observing concrete environment feedback, even for questions they answered correctly. Standard RL barely helps due to a credit-assignment mismatch.
To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate from $44.4\%$ to $7.7\%$) and task accuracy (e.g., from $75.1\%$ to $76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.
Blogger's Review: The RefGRPO method presented effectively narrows the reflection gap for LLM agents, significantly enhancing their self-assessment capability and task execution accuracy. This innovation not only provides new insights for RL applications but also lays a foundation for future self-improvement processes, warranting further exploration and application.