How LLMs Actually Work
Summary
This article is a thorough, reader-friendly tour of transformer-based LLMs, covering tokens, embeddings, positional encoding (RoPE), attention and multi-head attention, the feed-forward network, residual streams, normalization, and the next-token prediction loop. It also discusses architecture versus trained weights and practical efficiency mechanisms like MoE and speculative decoding.