Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it uniformly across the entire generative trajectory. However, text-to-image generation inherently possesses temporal and spatial structures: different denoising steps are responsible for various generation stages, and relevant content that determines text alignment often appears only in parts of the image. This granularity mismatch complicates policy updates focused on the generative components that truly affect the reward.
To tackle this issue, we propose SpatioTemporal Adaptive Reward (STAR) Allocation for RL post-training of text-to-image diffusion and flow models. STAR employs text-image attention within the generative model, starting from the core content that users genuinely care about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, allocating the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. Stronger policy updates are then applied to these regions through a spatially resolved policy objective.
We utilize Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results demonstrate that STAR enhances compositional semantic alignment, text rendering, and preference optimization without altering the external reward source, achieving $\textbf{0.9759}$, $\textbf{0.9757}$, and $\textbf{23.60}$ on GenEval, OCR, and PickScore, respectively.
Blogger's Review: The STAR method introduces a spatio-temporal adaptive mechanism that addresses the reward allocation problem in text-to-image generation, effectively focusing on key areas that influence quality. Its outstanding performance across various tasks offers new insights for further research in text-to-image generation, making it a significant contribution to the field.