Large Language Models (LLMs) exhibit strong performance on reasoning tasks, yet it remains unclear whether this reflects faithful logical inference or merely heuristic approximation. We investigate this issue in legal entailment by comparing three paradigms: pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a significant proportion of legally sound inferences lack formal grounding without additional unstated assumptions. While introducing formal structure improves accuracy—with LLM-based Formal Reasoning achieving the highest benchmark performance—we show that this gain does not imply faithful reasoning. We identify three recurring failure modes:
- Scope Laundering: LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, leading to conclusions that seem logically grounded but are not;
- Implicit Constraint Blindness: LLMs overlook logical constraints present in formal representations;
- Program Synthesis Failures: LLMs generate incorrect Z3 code despite structured prompting.
Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.
Blogger's Review: This paper offers an insightful examination of the limitations of LLMs in legal reasoning applications, particularly the tension between the accuracy of formal reasoning and its logical soundness. Addressing how to ensure that model outputs are both accurate and logically valid is a crucial direction for future research.