Multi-Token Prediction (MTP) with Gemma 4
Summary
The article provides a detailed visual guide to the Gemma 4 family of large language models, covering architecture (dense vs. Mixture of Experts), interleaved local/global attention, and efficiency tricks (GQA, K=V, p-RoPE, 2D RoPE, adaptive resizing, and soft token budgets). It also introduces the Multi-Token Prediction (MTP) mechanism and its drafter/drafting workflow, including Target Activations, KV cache sharing, and the Efficient Embedder, all aimed at speeding up on-device inference for multimodal capabilities (text, images, and audio).