Abstract
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. Recent multimodal deep search agents attempt to address this issue by utilizing external tools, yet the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search.
To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Our agent actively attends to fine-grained visual details and dynamically harvests visual evidence throughout the search process, rather than treating vision as a static input. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training.
Extensive experiments demonstrate state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments.
The code and data can be accessed at: Visual-Seeker GitHub
Blogger's Review: The introduction of Visual-Seeker breathes new life into the multimodal search domain, significantly enhancing model performance in complex scenarios through active visual reasoning. Its innovative data pipeline and high-quality training set lay a solid foundation for future research and deserve attention.