[CS.AI] Adversarial Concept Search: Predicting Compositio...

This paper investigates how to leverage the representational geometry of Large Language Models (LLMs) to predict which combinations of concepts the models are likely to fail on. Traditionally, developers have relied on designing difficult problems or constructing extensive benchmarks to capture challenging edge cases. However, we propose a novel approach: predicting failure scenarios based on feature interference.

The study reveals that in tasks requiring systematic composition—such as toy programmatic settings, multihop reasoning, and multilingual factual recall—when a pair of concepts is encoded near-orthogonally, the model reliably composes them. Conversely, when their linear encodings are close, interference occurs, leading to failure in composition.

Our method reliably anticipates failure modes across various compositional tasks without evaluating specific inputs. These findings lay the groundwork for using representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployments.

Blogger's Review: This research offers a novel perspective on understanding the failure modes of LLMs, highlighting how the relationships between features affect model performance. By identifying potential failure combinations in advance, developers can improve model robustness in complex tasks more effectively. Such research has significant implications for real-world applications, especially in high-stakes environments.

[CS.AI] Adversarial Concept Search: Predicting Compositional Errors