[CS.AI] ToolMenuBench: Benchmarking Tool-Menu Filtering f...

Abstract

Tool-augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, efficiency, and safety-relevant risk exposure.

We introduce ToolMenuBench, a benchmark for evaluating tool-menu construction in multi-step LLM agents. ToolMenuBench varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and reports both filter-level and downstream agent metrics, including visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.

In a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings, CMTF improves task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. Causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines.

ToolMenuBench provides a reusable evaluation framework for studying the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.

Blogger's Review: The launch of ToolMenuBench offers a systematic evaluation framework for tool selection and filtering in LLM agents, significantly enhancing task success rates while reducing resource consumption. This highlights the critical role of tool menu design in agent efficiency, paving the way for future LLM agents to operate more reliably and safely in task execution.

[CS.AI] ToolMenuBench: Benchmarking Tool-Menu Filtering for LLM Agents

Abstract