We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks.
To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events.
Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and a feasibility verifier filters out invalid ones using approximate constraints.
Evaluations demonstrate that language latency and the inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.
Blogger's Review: This study introduces a novel multi-agent collaboration benchmark through Minecraft, highlighting the impact of time sensitivity and dynamic environments on collaboration performance. It is significant for understanding and improving agent performance in complex scenarios, particularly the challenges they may face in real-world applications.