Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Summary
The article introduces Tiny-vLLM, a high-performance LLM inference engine implemented in C++ and CUDA. It walks through loading Safetensors models, embedding retrieval, normalization, RoPE, attention, and KV cache with CUDA kernels and cuBLAS-based matrix ops, plus practical tips for memory management and batching. It serves as a learning resource for building self-hosted AI inference pipelines on NVIDIA GPUs.