Introducing RadixAttention to Trellis
Summary
Trellis introduces RadixAttention to accelerate LLM inference on commodity hardware by caching KV embeddings and sharing prefixes across requests. The approach uses a block-paged KV cache and a radix-tree-based prefix caching to reduce recomputation and memory usage, with benchmarks showing 30-40% faster performance and lower memory usage. The post includes design notes, benchmarks, and a call for feedback.