[CS.AI] First-Principles Derivation of LLM Policy Optimiz...

Abstract

Policy gradient algorithms for language models optimize the same objective $J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[R(\tau)]$, which consists of two factors: trajectory probability $p_\theta(\tau)$ and reward $R(\tau)$. Each method from REINFORCE to PPO to GRPO modifies one or both factors to address specific failures in previous formulations. Existing surveys organize these methods by domain or chronology, obscuring the rationale behind each design choice and the precise location of its intervention within the gradient estimator.

This survey revisits the landscape of LLM policy optimization from first principles using $J(\theta)$, employing the trajectory side induced by $p_\theta(\tau)$ and the reward side induced by $R(\tau)$ as the axes on which methods are located. It covers the evolution from REINFORCE and PPO to GRPO, as well as post-GRPO variants, Agentic RL, and GRPO-OPD. The resulting framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across these settings.

Across these settings, the framework also exposes compound failures that no single-side fix resolves, requiring joint design of the trajectory side and the reward side. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.

Blogger's Review: This paper dives deep into the complexities of LLM policy optimization from first principles, clearly illustrating the interplay between trajectory and reward dimensions. This framework not only aids in understanding the limitations of current methods but also provides valuable guidance for future algorithm design.

[CS.AI] First-Principles Derivation of LLM Policy Optimization

Abstract