Optimal hyperparameters for a simple nonlinear model.

Open Question 3.1: Optimal hyperparameters for a simple nonlinear model. In a simple but nontrivial model — say, a linear network of infinite width but finite depth, trained with population gradient descent — what are the optimal choices for the layerwise init scales and learning rates – not just the width scalings but also the constant prefactors? Are they the same or different between layers? Do empirics reveal discernible patterns that theory might aim to explain?

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion