Abstract
Strategic reasoning under uncertainty is crucial for consequential decisions in negotiation, finance, and policy, yet prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined.
We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile, decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness.
We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others.
These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.
Blogger's Review: This study offers a fresh perspective on evaluating LLMs' strategic reasoning capabilities through the design of Poker Arena. The introduction of a multi-axis evaluation method aids in accurately understanding the capability structure of different models, highlighting the significance of memory in complex decision-making. This provides valuable insights for future AI model development, especially in dealing with uncertainty and multi-dimensional decision-making.