[CS.AI] Revolutionary Benchmark: Unveiling Latent Failure...

Large Language Models (LLMs) are increasingly used as planners for autonomous agents in household settings. Current benchmarks focus on evaluating whether LLM-generated plans can execute successfully but overlook a critical type of failure: latent failures. Unlike immediate failures that provide instant feedback during execution and allow timely corrections, latent failures do not halt plan execution immediately but silently undermine goal achievement, potentially causing irreversible harm.

To address this gap, we introduce SIMMER, a benchmark for assessing latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 semantically realistic interactions derived from real-world cooking scripts. It employs a state machine executor to validate plans against the world model and detect immediate precondition violations, latent hazards, and irreversible failures.

Experiments across six LLMs demonstrate that even frontier models achieve at most 17% error-free plans. Furthermore, up to 56% of plans contain latent failures, most of which lead to irreversible consequences. We also show that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, indicating a promising direction for more robust LLM planners.

Blogger's Review: The introduction of the SIMMER benchmark not only addresses the critical gap in assessing latent failures in existing LLM planning but also provides significant insights for future research. By incorporating counterfactual reasoning, the planning capabilities of LLMs are poised for a qualitative leap, offering more reliable decision-making support for autonomous agents.

[CS.AI] Revolutionary Benchmark: Unveiling Latent Failures in LLM Planning