Abstract
LLM-based agents are becoming increasingly capable and widely deployed, creating growing incentives for adversarial misuse in the real-world. A key emerging threat is Decomposition Attacks, where a harmful task is broken into simpler, benign subtasks that evade safety mechanisms when executed separately but cumulatively fulfill the malicious intent.
Although recent benchmarks assess agent safety in multi-turn and multi-tool-use settings, they do not explicitly capture this form of decompositional misuse and may not represent realistic adversarial execution flows. To this end, we introduce DeCompBench, a benchmark designed specifically to evaluate agent safety under decomposition attacks.
DeCompBench is created with a decomposition-by-design principle using a graphical framework, enabling harmful task decomposition into individually benign and executable subtasks with realistic workflows. Our experiments using a custom decomposer show that state-of-the-art agents exhibit high refusal rates on monolithic harmful tasks but significantly lower refusal rates on their decomposed variants, often inadvertently fulfilling adversarial objectives.
These findings underscore the need for safety evaluations against decomposition attacks and corresponding defenses. Our dataset is publicly available at Hugging Face.
Blogger's Review: The risk of decomposition attacks is severely underestimated in the current applications of large language models. The introduction of DeCompBench not only highlights the limitations of existing safety evaluations but also provides new insights for future security strategies. It's a must-watch for all developers and researchers!