Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Summary
The arXiv paper proposes sequential KV cache compression for transformers using probabilistic language tries and predictive delta coding, arguing that sequence-level compression beats per-vector entropy limits. It presents a two-layer architecture that yields tight entropy bounds and extremely large theoretical compression ratios, while remaining compatible with existing per-vector quantization methods. The work has implications for improving inference efficiency and cost in LLM deployments.