Introduction
The memory of personal agents plays a central role in transforming stored information and prior interactions into future-oriented assistance. Useful cues arise from what the agent observes and how users interact with it, necessitating the agent to carry these cues from current requests to similar future tasks.
Existing Issues
Current memory benchmarks typically test dialogue recall or task improvement in isolation, largely neglecting the trajectory from streaming observations to subsequent assistance.
Introducing StreamMemBench
We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task assesses evidence use, while the follow-up task evaluates the reuse of feedback and interaction experiences.
Evaluation Metrics
Four metrics are used to diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse.
Experimental Results
Experiments involving eight memory systems across two backbones indicate that current systems often fail to effectively utilize observed evidence or translate feedback into reliable follow-up behaviors, even when evidence is stored or feedback is locally incorporated.
StreamMemBench is publicly available at: GitHub - StreamMemBench.
Blogger's Review: The introduction of StreamMemBench addresses a critical gap in streaming memory evaluation. By systematically testing the conversion of observed evidence into subsequent tasks, it provides a vital reference framework for the development of future intelligent assistants. The assessment of existing systems also highlights their shortcomings, driving deeper research in the field.