lyogavin/airllm
Summary
AirLLM optimizes inference memory usage to enable running large language models on consumer-level hardware. It highlights capabilities such as 70B-scale inference on a 4GB GPU without quantization or pruning, and supports larger models (e.g., 405B Llama3.1) with higher VRAM. The project provides quickstart guides, notebooks, configurations, and a community-driven ecosystem around model compression, configurability, and cross-model support, all under an open-source license.