Comment fonctionnent les LLMs : de la tokenisation aux transformers
Summary
The article is a comprehensive, accessible overview of how modern large language models work, covering tokenization, embeddings, positional encoding (RoPE), attention (Q/K/V, softmax, causal masking), multi-head attention, feed-forward networks, residual streams, and next-token prediction. It also discusses architectural choices across models, the role of trained weights, speculative decoding, and the convergence of transformer-based designs, with notes on future directions and interpretability insights.