Large language models (LLMs) are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply better learning support. Motivated by recent calls to measure the social impact of NLP systems, we explore whether public LLM tutoring benchmarks can distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance.
Using public MathTutorBench leaderboard results, we find that these dimensions are only partially aligned: the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding.
Together, these findings suggest that educational impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.
Blogger's Review: This paper provides an essential evaluation framework for the application of LLMs in education, highlighting the distinction between learning support and simple answer generation. It advocates for a more effective assessment of educational technology, emphasizing the need to balance task-solving and teaching support in future research.