Abstract
Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes.
To address these issues, we propose Pinned Chain-of-Thought ( exttt{pincot}), a structured reasoning paradigm that pins every reasoning step to visual evidence. exttt{pincot} introduces the concept of exttt{reasoninganchor}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views.
We build a fully automated data generation pipeline to construct exttt{dataset}, a high-quality exttt{pincot}-formatted reasoning dataset. We then train exttt{method} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning.
On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, exttt{method} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12 ext{ extperthousand} average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that exttt{pincot} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.
Blogger's Review: The Pinned Chain-of-Thought introduced by RoboPIN offers a fresh perspective on embodied reasoning, tightly integrating the reasoning process with visual evidence and overcoming the limitations of traditional models. Its remarkable performance in multi-view scenarios is particularly noteworthy, warranting attention for its potential in broader applications.