In production, LLM assistants route user requests to an ever-growing library of specialized tools, but how does routing accuracy degrade as the catalog scales? We studied a catalog of 110 agents and 584 tools, evaluating three frontier models from 10 to 110 agents.
The routing F1 score on under-specified requests drops by 16 to 23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model fails to surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops by 10pp).
Embedding-based shortlisting recovers +10 to 11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms recovery on real traffic at +10 to 17pp despite 10 to 15pp lower absolute performance.
Blogger's Review: This article offers a deep dive into the reasons behind the decline in routing accuracy when scaling agent numbers, presenting effective recovery strategies. Through precise model evaluation and real-world application validation, the embedding approach demonstrates significant performance enhancement, providing valuable insights for enterprise applications.