From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
Summary
The article explains how the KV cache memory per token evolves across LLM architectures, illustrating reductions from GPT-2's 300 KiB per token to lower footprints achieved by grouped-query and latent attention schemes. It discusses memory eviction, prompts, external memory scaffolding, and approaches like learned compaction, highlighting the economic and architectural tradeoffs in sustaining longer conversations.