Spatial Vision Language Models (VLMs) have made significant strides in geometric perception, yet complex spatial reasoning involving depth, distance, and scene relations remains challenging. Different spatial queries require fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others necessitate explicit 3D grounding before quantitative inference. We introduce Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference.
SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines:
- A single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning;
- Jointly training both paths fosters mutual reinforcement;
- High-quality, blended cold-start data is crucial for stable RL optimization;
- The model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.
Blogger's Review: The introduction of SR-REAL offers an innovative solution for spatial reasoning, effectively combining linguistic deduction with geometric detection. The dual-path design showcases potential for performance enhancement in complex tasks. Future work will focus on optimizing data quality and model generalization capabilities.