Fine-grained action recognition in egocentric video poses significant challenges for Vision-Language Models (VLMs), as actions often differ only in subtle visual cues, leading to model biases towards specific cues. We introduce Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework that operates as follows:
- Video Chunking: A VLM orchestrator segments the video and proposes a top-k candidate label list for each segment;
- Heterogeneous Specialist Engagement: An ensemble of heterogeneous VLM specialists from various open model families participates in a structured deliberation, including a peer-consultation round of questions;
- Ranking Aggregation: Agent rankings are aggregated using a Borda count, and the orchestrator re-ranks its own predictions based on the specialists' evidence.
The entire pipeline operates locally without fine-tuning. Experiments demonstrate that our method significantly enhances zero-shot action recognition performance over the baseline, highlighting the impact of the heterogeneous deliberation step. The gains arise from decorrelated model priors rather than from additional computational resources.
Blogger's Review: This multi-agent framework cleverly addresses the bias issues in fine-grained action recognition through the collaboration of heterogeneous models, showcasing the potential for improving model performance without relying on extensive labeled data. This approach has broad implications for future action recognition tasks.