The computer systems community has seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks like AdaEvolve and Engram report score improvements of 12-60% over human-designed algorithms. However, there are practical concerns regarding the performance of AI-evolved programs on unseen workloads and potential scalability regressions. To address these issues, we need automated mechanisms to uncover hidden weaknesses in AI-evolved systems.
To this end, we developed AIChilles, which takes as input a baseline program $P$ and an AI-evolved program $P'$. AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types, and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures.
Across five system applications and 30 AI-evolved programs, AIChilles found 49 distinct hidden weaknesses. We also demonstrate that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.
Blogger's Review: The introduction of AIChilles provides critical safety assurance for AI-driven system evolution. By actively identifying potential weaknesses, it significantly enhances the robustness and reliability of AI programs. This research outcome will have a profound impact on the future development of AI systems.