TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Summary
TIPSv2 advances vision-language pretraining by introducing three targeted improvements: iBOT++, Head-only EMA, and Multi-granularity captions, which collectively enhance patch-text alignment. The approach leverages distillation to unlock superior alignment, achieving state-of-the-art zero-shot segmentation and strong performance across dense and global image-text tasks, with efficient training. The work also provides visualization tools and demonstrates finer semantic detail in feature maps.