[CS.AI] StreamMemBench: Streaming Evaluation for Future-O...

Introduction

The memory of personal agents plays a central role in transforming stored information and prior interactions into future-oriented assistance. Useful cues arise from what the agent observes and how users interact with it, necessitating the agent to carry these cues from current requests to similar future tasks.

Existing Issues

Current memory benchmarks typically test dialogue recall or task improvement in isolation, largely neglecting the trajectory from streaming observations to subsequent assistance.

Introducing StreamMemBench

We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task assesses evidence use, while the follow-up task evaluates the reuse of feedback and interaction experiences.

Evaluation Metrics

Four metrics are used to diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse.

Experimental Results

Experiments involving eight memory systems across two backbones indicate that current systems often fail to effectively utilize observed evidence or translate feedback into reliable follow-up behaviors, even when evidence is stored or feedback is locally incorporated.

StreamMemBench is publicly available at: GitHub - StreamMemBench.

Blogger's Review: The introduction of StreamMemBench addresses a critical gap in streaming memory evaluation. By systematically testing the conversion of observed evidence into subsequent tasks, it provides a vital reference framework for the development of future intelligent assistants. The assessment of existing systems also highlights their shortcomings, driving deeper research in the field.

[CS.AI] StreamMemBench: Streaming Evaluation for Future-Oriented Assistance

Introduction

Existing Issues

Introducing StreamMemBench

Evaluation Metrics

Experimental Results