Abstract
Large Language Model (LLM) agents are increasingly involved in tasks like screening applicants, recommending credit, and triaging patients, yet their fairness is still primarily assessed through grading answers. We introduce AgentFairBench, an inexpensive, reproducible, multi-domain benchmark for evaluating demographic disparity in LLM agents' actions.
AgentFairBench is grounded in a companion framework, the Bias Conduction Framework (BCF), and covers three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are assessed in counterfactual matched sets that vary only by a name-coded race x gender signal (following the Bertrand and Mullainathan tradition) under four increasing agency scaffolds (direct, chain-of-thought, multi-agent deliberation, tool-augmented).
A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, utilizing bootstrap confidence intervals, paired tests, and false-discovery-rate control, all for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary allows for external model submissions.
Our pilot study (864 decisions plus a test-retest replication) yields a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by approximately 2.4X due to statistical arity alone. Against an arity-matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms that the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.
Blogger's Review: AgentFairBench offers a fresh perspective on fairness assessment in LLMs, particularly in multi-domain applications. Its reproducibility and cost-effectiveness make it a crucial tool for researchers amid growing concerns about algorithmic fairness. By comparing various agent architectures, researchers can gain deeper insights into model performance and bias in real-world applications.