NeFut Logo NeFut
Admin Login

[CS.AI] MemTrace: Unveiling What Final Accuracy Misses in Long-Term Memory

Published at: 2026-06-17 22:00 Last updated: 2026-06-20 13:45
#AI #Machine Learning #LLM

Abstract

As LLM agents increasingly maintain long-term memory of user facts across sessions, current evaluation methods typically aggregate accuracy over question rows or episodes. This approach scores question rows independently, failing to demonstrate how a fact behaves as conditions change, even when several questions probe the same fact.

We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions:

  1. Memory age, defined by how many sessions ago the fact appeared in the history;
  2. Question type, covering current state, earlier state, and trajectory of change;
  3. Evidence condition, covering present, missing, and contradicted-by-false-premise settings.

Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

Blogger's Review: The introduction of MemTrace offers a fresh perspective on evaluating LLM long-term memory, emphasizing the significance of evidence usage. Future research can build on this benchmark to further explore strategies for optimizing memory systems.

Original Source: https://arxiv.org/abs/2606.17328

[h] Back to Home