Learnings from 4 months of Image-Video VAE experiments
Summary
Linum shares four months of hands-on learnings from building an image-video VAE, highlighting that improving reconstruction quality does not always translate to better downstream generation. The post covers baseline architecture, co-training instability, normalization hacks, and the shift to alternative approaches like Wan 2.1 VAE for embedding efficiency, with insights on training across resolutions and future directions.