Bringing up DeepSeek-V4-Flash on AMD MI300X
Summary
Bringing up DeepSeek-V4-Flash on AMD MI300X is a worklog detailing the efforts to run DeepSeek V4-Flash on AMD's MI300X accelerator. The post covers FP8 dialect issues, missing attention fast paths, HIP graphs, tuning activities, and eventual performance observations, including a modest token-per-second improvement and ongoing portability concerns. It highlights the software gap relative to hardware and notes open-source contributions and potential upstreaming.