Computing sharding with einsum
Summary
The post advocates using einsum notation to reason about sharding for distributed tensor operations in DTensor. It provides an einsum primer, explains backwards mode, and outlines sharding rules with concrete examples including tensor and sequence parallelism, illustrating how partial gradients propagate in distributed settings.