Do Transformers Need Three Projections? Systematic Study of QKV Variants
Summary
The paper systematically investigates QKV projection sharing in Transformer attention (Q-K=V, Q=K-V, Q=K=V) and introduces 2D positional encodings to enable asymmetric attention. It reports substantial memory/cache reductions with minimal accuracy loss in language modeling, especially when combined with head-sharing schemes, and provides open-source code for replication—valuable for on-device and edge deployment.