NeFut Logo NeFut
Admin Login

[CS.AI] Visual-Seeker: Revolutionizing Visual-Native Multimodal Search via Active Visual Reasoning

Published at: 2026-06-16 22:00 Last updated: 2026-06-17 01:38
#AI #Machine Learning #Open Source

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. Recent multimodal deep search agents attempt to address this issue by utilizing external tools, yet the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search.

To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Our agent actively attends to fine-grained visual details and dynamically harvests visual evidence throughout the search process, rather than treating vision as a static input. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training.

Extensive experiments demonstrate state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments.

The code and data can be accessed at: Visual-Seeker GitHub

Blogger's Review: The introduction of Visual-Seeker breathes new life into the multimodal search domain, significantly enhancing model performance in complex scenarios through active visual reasoning. Its innovative data pipeline and high-quality training set lay a solid foundation for future research and deserve attention.

Original Source: https://arxiv.org/abs/2606.15231

[h] Back to Home