AI evaluations are widely used for testing and understanding progress, but inconsistencies arise from diverse evaluators, challenging analysis and comparison. Results are often saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Furthermore, different evaluation frameworks generate divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse.
To address these issues, we introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, while optionally storing per-instance outputs for fine-grained analysis.
Our contributions include: (i) a community-governed metadata schema with a companion instance-level schema, marking the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.