[CS.AI] Visual Misleading vs Consistency: Unpacking Spati...

Abstract

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. We challenge the common intuition, termed the Attention-Confidence Assumption, which suggests reliability follows from "structural" visual perception: tight attention on relevant regions signals a trustworthy answer, while scattered attention indicates confusion.

Through the VLM Reliability Probe (VRP), we conduct a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, tracking its evolution (Delta H_s) across layers.

This reveals a phenomenon we call "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429).

Scaling causal interventions exposes sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, remaining resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps, best inferred from generation-time dynamics and hidden-state probes.

Blogger's Review: This paper challenges the conventional link between visual attention and model reliability, proposing self-consistency as a more reliable predictor. This finding will influence future design and evaluation standards for Vision-Language Models, emphasizing the importance of dynamic behaviors during the generation process. By delving into the structure and generative mechanisms of models, the study provides a new perspective on understanding reliability.

[CS.AI] Visual Misleading vs Consistency: Unpacking Spatial Attention and Reliability in VLMs

Abstract