[CS.AI] Trust Dynamics in AI Agents: Formation, Breakage,...

Abstract

As language-model agents increasingly collaborate, trust between them becomes crucial. However, there is currently no standard way to measure trust. We propose a behavioral measure based on costly verification.

In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Compared to a memoryless version of the model, reduced verification offers an observable measure of trust. Using this framework, we investigate trust formation, breakage, and recovery across six frontier model snapshots.

When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduced verification by approximately 60-85%, while two smaller snapshots showed little to no adjustment. Failures reverse this discount, but models differ in their responses. Some focus renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion longer than the same number of spread-out failures.

These differences have practical implications. Models forming trust verify less, decide faster, and achieve higher payoffs in our environment. In contrast, persistent over-verification correlates with indecision rather than safety. Our results suggest that trust dispositions can be measured pre-deployment and emphasize that calibration, rather than maximal suspicion, should be central in governing multi-agent AI systems.

Blogger's Review: This paper introduces the concept of costly verification, providing deep insights into the trust dynamics among AI agents. It reveals mechanisms for trust formation and recovery, offering a new perspective for effective governance in multi-agent systems, highlighting the importance of balancing trust and scrutiny in practical applications.

[CS.AI] Trust Dynamics in AI Agents: Formation, Breakage, and Recovery

Abstract