Google’s TurboQuant AI-compression can cut LLM memory usage by 6x without sacrificing quality
Summary
Google's TurboQuant compresses the key-value cache in large language models using PolarQuant to reduce memory by up to 6x while maintaining output quality. It combines a 1-bit error-correction layer (QJL) to smooth residuals, enabling 3-bit quantization without retraining and 4-bit operation to achieve about 8x faster attention on Nvidia H100 GPUs. If adopted, TurboQuant could lower costs and memory needs, with potential benefits for edge and mobile AI, though it may also enable more complex models with freed resources.