Playing with Vision Embeddings
Summary
The post explores how vision embeddings from DINOv3 ViT-S encode images into a 384-number vector and how those embeddings can be inverted to generate images using differentiable optimization and augmentation techniques. It introduces sparse autoencoders (SAEs) to extract thousands of interpretable feature directions, demonstrates visualization, interpolation between features, and decomposition, and discusses implications for understanding neural visual representations.