Quantization from the ground up
Summary
Ngrok's Quantization from the ground up explains how large language models can be dramatically smaller and faster through quantization, comparing float32/float16/float8 and very low-bit formats. It covers symmetric vs asymmetric quantization, scaling/zero-point concepts, outliers, and practical benchmarks (perplexity, KL divergence, and speed) using llama.cpp, with code and commands to quantize and evaluate models locally.