Hyperparameter scaling for MoEs.

Open Question 3.11: Hyperparameter scaling for MoEs. What hyperparameter scaling prescriptions apply to mixtures of experts? Can the central arguments of \(\mu\)P be imported and used to obtain initialization scales and learning rates that give rich training at infinite width?

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion