Abstract
As LLM agents increasingly maintain long-term memory of user facts across sessions, current evaluation methods typically aggregate accuracy over question rows or episodes. This approach scores question rows independently, failing to demonstrate how a fact behaves as conditions change, even when several questions probe the same fact.
We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions:
- Memory age, defined by how many sessions ago the fact appeared in the history;
- Question type, covering current state, earlier state, and trajectory of change;
- Evidence condition, covering present, missing, and contradicted-by-false-premise settings.
Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.
Blogger's Review: The introduction of MemTrace offers a fresh perspective on evaluating LLM long-term memory, emphasizing the significance of evidence usage. Future research can build on this benchmark to further explore strategies for optimizing memory systems.