What's even going on with norm layers?

Open Question 3.14: What's even going on with norm layers? What scaling relationships apply to norm layers embedded within deep neural networks? We’re interested here in both hyperparameter scaling prescriptions like \(\mu\)P and empirical scaling relationships which relate, say, the number or strength of norm layers to statistics of model weights, representations, or performance.

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion