Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
Summary
The paper investigates making large Mixture-of-Experts models more accessible on hardware-constrained environments. Rotary GPU demonstrates a local execution approach that can run a 8GB-VRAM consumer laptop setup (RTX 4060) with 2048 output tokens and about 6.3 GB VRAM usage, suggesting deployment of advanced MoE models closer to edge devices. It is framed as exploratory with deployment accessibility as the goal rather than replacing data-center infrastructure.