TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS
Summary
SwiftLM is a native Swift inference server for Apple Silicon that offers an OpenAI-compatible API with SSD-based streaming for large MoE models and advanced KV cache quantization. It introduces TurboQuant KV compression with a hybrid V2/V3 approach to achieve high-quality quantization at near-V2 speeds, enabling on-device AI workloads with reduced memory footprints. The project also provides an iOS app (SwiftLM Chat) and detailed build/run instructions, emphasizing on-device model loading and fast, zero-copy streaming from NVMe storage.