Softmax, can you derive the Jacobian? And should you care?
Summary
Provides a thorough explanation of softmax, its effect on distributions, and the Jacobian structure (diag(s) minus outer product ss^T). It covers numerical stability, the role of axis in batches, the backward pass, and how softmax pairs with cross-entropy, with practical Python code and insights for efficient neural network training.