[CS.AI] Uncovering Reward Exploitability in Code RL Envir...

In this study, we measure the rate at which code Reinforcement Learning (RL) environments accept incorrect solutions as correct. Analyzing a sample of 49 tasks from SWE-bench Verified, we found that 28.5% of tasks had test suites weak enough that a Docker-verified incorrect patch could pass. For 20 tasks in R2E-Gym, the same pipeline showed a 25.0% exploit generation rate. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified indicates that within the same human-rated difficulty stratum, the model Pass@1 is 14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p-value).

Blogger's Review: This research uncovers the vulnerabilities in code RL environments' test suites, revealing a high misjudgment rate when facing incorrect solutions. The findings are crucial for developing more robust RL systems and highlight the importance of code validation in future research endeavors.

[CS.AI] Uncovering Reward Exploitability in Code RL Environments