Megakernel Qwen3.5 0.8B on RTX 3090 and DFlash 27B on RTX 3090: Local LLM Inference Benchmarks
Summary
This GitHub repo showcases two hand-tuned LLM inference projects for the RTX 3090: a Megakernel for Qwen3.5-0.8B achieving high efficiency with a single CUDA dispatch, and a DFlash DDTree port for Qwen3.5-27B delivering up to 207 tok/s. It emphasizes local AI, reproducible benchmarks, and an open-source MIT license, with setup steps and benchmark results to replicate on consumer hardware.