[CS.AI] Limited Marginal Benefit of Reasoning-Heavy Model...

The automated scoring of ESG narrative disclosures using large language models (LLMs) is gaining traction. However, whether reasoning-heavy frontier models provide value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes—quantitative targets, progress-tracking infrastructure, and external-standard alignment—using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries.

Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).

Blogger's Review: This study highlights the limited marginal benefits of reasoning-heavy models in ESG scoring, emphasizing the contradiction between development costs and actual benefits. It serves as a reminder for careful model selection, particularly when resources are constrained.

[CS.AI] Limited Marginal Benefit of Reasoning-Heavy Models in ESG Scoring