Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

May 19, 2026 at 16:40

Quality: 8/10 Relevance: 9/10

Summary

A thorough look at recent open-weight LLM architecture innovations focused on long-context efficiency. The piece covers KV sharing and cross-layer KV reuse in Gemma 4, per-layer embeddings, layer-wise attention budgeting in Laguna XS.2, compressed attention in ZAYA1-8B, and CSA/HCA in DeepSeek V4, with discussion of tradeoffs between memory, compute, and modeling capacity.

AI Research Open Source Open Source News

Read Original Article