In visual-language models (VLMs), verifier-driven self-DPO (Decision Process Optimization) is a common approach for self-improvement. In this setup, a frozen verifier scores candidate generations, and the top- and bottom-scoring candidates form a preference example, while DPO updates the learner. However, the deployment-time assumption is monotonic: a stronger verifier should yield a stronger student. We demonstrate that this assumption can fail due to the task-specific nature of verifier quality.
On a four-rung open-source verifier ladder across MathVista, MMMU, and BLINK, the same verifiers that are above-threshold and improve a Qwen-3-VL-2B student on MathVista become sub-threshold on MMMU, where their task-rubric accuracy drops to between 8% and 23%. In this regime, every verifier we tested silently regressed the student, resulting in drops of 3.4 to 10.9 percentage points below the frozen baseline, even as the DPO training loss continued to decrease. This regression replicated on a second student, Qwen-2.5-VL-3B.
Moreover, within the failure regime, damage is confidence-inverted: the more accurate-but-still-wrong verifier causes larger regression than a near-random verifier, suggesting that progress-gated replay amplifies confidently wrong preference pairs. We provide a compact mechanistic explanation via a variance theorem for progress-gated replay and its direction-mismatch failure mode. The deployment message is operational rather than purely diagnostic: before running any verifier-driven loop, teams should measure target-task rubric accuracy, rank verifiers by target-task rubric quality rather than parameter count, and treat diminishing returns in above-threshold regimes as a verifier-side compute budget cap.
Blogger's Review: This article highlights the profound impact of verifier quality on self-improving models, emphasizing the importance of assessing verifiers in the context of specific tasks. As VLMs become widely deployed, teams must focus on actual model performance rather than just parameter size. A sound verifier selection and evaluation strategy will be crucial for ensuring model efficacy.