[CS.AI] Revolutionizing Prefix Caching: Editable and Comp...

In prefix caching, prefill only reuses across an exactly shared prefix, hence a change in any field invalidates the entire downstream cache. However, overwriting the field's own key/value vectors while reusing the rest leaves the model acting on the old value. This phenomenon has been causally established across four model families: at prefill, the model has already written the field-conditioned conclusion onto downstream notes, and the field's key/value drives less than 1% of the decision.

Reading as a notebook of memoized conclusions, two capabilities emerge: (1) Editability. A salient erratum amends the notes; with chain-of-thought (CoT), editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while ignoring it without CoT. (2) Composability. The notes are position-portable, allowing a precompiled skill to be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine similarity from 0.90 to 0.999, validated across twelve models) at O(L) instead of O(L^2) time-to-first-token. A unified edit+compose agent remains decision-identical to recompute at up to 14.9x lower latency.

This approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to various attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark, it keeps the prefix cache aligned (98.5% hit-rate), reducing p90 time-to-first-token by 53-398x.

[CS.AI] Revolutionizing Prefix Caching: Editable and Composable KV Cache