Which one is more important: more parameters or more computation? (2021)
Summary
The article argues for treating model size (parameters) and compute as separate axes, introducing Hash Layers and Staircase Attention as orthogonal approaches to increase capacity or compute without altering the other. It presents empirical results showing hashing-based MoE can yield efficiency and performance gains, while Staircase/Ladder architectures boost performance by increasing compute per parameter; combining the two can yield further improvements.