[CS.AI] AgentCyberRange: Benchmarking AI in Realistic Cyb...

Frontier AI systems are increasingly demonstrating capabilities in cybersecurity tasks such as codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities is constrained by limited access to open, reproducible multi-host cyber ranges. Existing public benchmarks often capture isolated skills like CTF solving, vulnerability reproduction, and exploit generation, but abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it challenging to observe emerging risks early, as frontier AI systems are rarely evaluated under realistic attack conditions.

To address this, we introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. This benchmark combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, along with Cage, a toolchain for execution, orchestration, result collection, and verification.

The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; these rates increase to 33.0% and 46.3% with more concrete hints. Additionally, we observe out-of-benchmark findings, including unknown vulnerabilities in popular projects and payload mutation that bypasses host defenses. These results indicate that open cyber-range evaluation is essential for observing emerging offensive capabilities under realistic and reproducible conditions.

Blogger's Review: The introduction of AgentCyberRange offers a fresh perspective on evaluating frontier AI systems in cybersecurity, highlighting the importance of testing in realistic environments. This tool could accelerate advancements and applications in cybersecurity technology, warranting ongoing attention and further research into AI's potential in cyber attacks.

[CS.AI] AgentCyberRange: Benchmarking AI in Realistic Cyber Ranges