Fast KV Compaction via Attention Matching
Summary
The paper introduces a fast context compaction method for long-context language models using Attention Matching to create compact KV caches that reproduce attention outputs. It decomposes the problem into solvable subproblems and reports up to 50x compression with minimal quality loss on select datasets, offering a practical route to reduce inference cost in deployment.