Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective post-training paradigm for enhancing the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, leading to sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Recent studies have introduced intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, but these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones.
To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each $n$-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.
Blogger's Review: STRIDE offers an innovative approach by contrasting successful and failed trajectories, optimizing the verifiability and reasoning capabilities of reinforcement learning. This method not only enhances model accuracy but also provides new insights for future research, making it worthy of attention and further exploration.