Quickstart Guide: Hyperparameter selection (and why theorists should care)

Training a neural network is a strange and mysterious process: a bewildering cavalcade of tensors is randomly initialized and subject to repeated gradient updates, and as a consequence of these simple operations, the whole thing learns. Meanwhile, before the process begins, you have to set your learning rate and some other fiddly knobs and dials. In comparison with the weights themselves, these hyperparameters might seem drab, technical and mundane; does a serious deep learning theorist really need to bother with them? It turns out the answer is emphatically “yes”: not only is the study of hyperparameters the easiest place for theory to make a practical impact, it’s also essential for the rest of the field, since any hyperparameter you can’t control for will interfere with your attempts to study anything else.

While most hyperparameters are still waiting for good theory, we understand a few. We’ll explain how theorists currently think about hyperparameters, list off the successes, and point out some frontiers. This chapter will be comparatively long, so we’ll start with a table of contents.

Quickstart Guide: Hyperparameter selection (and why theorists should care)

Classes of hyperparameter: optimization, architecture, and data
How to deal with hyperparameters as a theorist
Recurring motif: hyperparameter scaling relationships
Width, initialization scale, and learning rate
“Wider is better”
Depth
Batch size
Transformer-specific hyperparameters
Activation function
New frontiers

Classes of hyperparameter: optimization, architecture, and data

By hyperparameters, we mean all the numbers and choices that define our training process. Unlike the network parameters — the tensor weights optimized during training — hyperparameters are chosen at the start of training and don’t change. Together, the hyperparameters specify the optimization process, the architecture, and the preprocessing of the dataset. The choice of hyperparameters can make a big difference in model performance, and unless you have a smart way to choose them, it’s not uncommon to spend 10-100x the effort and compute you’d expend training the final model just optimizing hyperparameters. A little math can make this search much more methodical, and thus the science of hyperparameters is up and away the most practically impactful area of deep learning theory in 2025.

Even a simple neural network training procedure involves a dizzying array of hyperparameters. We will narrow in on only a few key hyperparameters for analytical study, but here at the outset, it’s worth enumerating a full list to get a sense of scope.

Hyperparameters can be grouped into three categories. First, optimization hyperparameters dictate how network parameters are initialized and respond to gradient.

The optimizer — SGD, Adam, Muon, or similar — fixes the functional form by which parameter gradients are turned into parameter updates.
The optimizer will have one or more hyperparameters — learning rate \(\eta\), momentum \(\mu\), weight decay \(\text{wd}\), tolerance \(\epsilon\), etc. — that enter as constants in the function above. These parameters affect the dynamics of optimization.
The batch size \(B\) and total step count \(T\) specify the amount of data in each gradient batch and the total number of batches to process.
The initialization scale \(\sigma\) for each layer specifies the size of parameters at init.

Optimization hyperparameters tend to be the most quantitative: many are real-valued numbers instead of choices from a discrete set. As a result, they are the most amenable to theoretical analysis and will receive most of our attention in this chapter.¹

Architectural hyperparameters dictate the structure of the network’s forward computation. These include the architecture type (MLP, CNN, transformer, etc.), the number of layers and their widths, the presence and location of norm layers, the choice of nonlinear activation function, the floating point resolution, and any other quirks of the network structure. The space of possible architectures is large and ill-defined, and it remains difficult to search over methodically. Of these hyperparameters, we will mostly discuss only network width, depth, and activation function, as these are the ones we currently know how to study.

Lastly, data hyperparameters include the choice of dataset, any cleaning, curation or tokenization procedures, any choice of curriculum, and any fine-tuning procedure. Though interesting and crucial, these choices are beyond the reach of current theory, and we will mostly omit them from the rest of the chapter.

How to deal with hyperparameters as a theorist

A practitioner concerned only with model performance can optimize hyperparameters numerically and forget about them. A theorist ought to be concerned with more than model performance, though, and should try to predict quantities that will be affected by these hyperparameters. For example, the loss (or sharpness, or feature change, etc.) after \(T\) steps will usually depend on the learning rate \(\eta\) (and a whole lot else), so any quantative prediction of the type we would like to make will depend explicitly on \(\eta\) and other hyperparameters. At first, this might seem to spell doom for our hopes for simple theory. What can we do?

There are two main answers. The first and easiest is to simply remove any hyperparameter you can. If you can do your study without a hyperparameter — momentum, say, or norm layers — do it; you can always add them back later, and your science will be clearer with fewer bells and whistles in the way. If your optimization process has a single unique minimizer reached no matter the learning rate, make sure you train for long enough to reach it. Doing this, you can usually reduce the problem to a few optimization hyperparameters (e.g. learning rate(s), batch size, and step count) and a few architectural hyperparameters (e.g. width, depth, activation function, init scale).

Recurring motif: hyperparameter scaling relationships

After removing all the hyperparameters you can, you should look for scaling relationships between your hyperparameters that let you reduce the effective number. For example, if you map \((\eta, T) \mapsto (\frac{1}{2} \eta, 2 T)\) — that is, you halve the learning rate and double the step count — then you approximately get the same training dynamics, so long as \(\eta\) was small enough to begin with. This “gradient flow” limit is very useful: unless finite stepsize effects are important for your study, you should work in it. In this limit, we only need to care about the “effective training time” \(\tau := \eta \cdot T\) and can forget \(\eta\) and \(T\) as independent quantities, so we’ve effectively reduced our number of hyperparameters by one.

The same is true of large width. In the previous chapter, we discussed how for a middle layer of a moderately wide MLP, the parameters should be initialized with scale \(\sigma \sim \frac{1}{\sqrt{\text{width}}}\). If you quadruple width, you should halve the init scale. If you also adjust the learning rate in accordance with \(\mu\)P, you can work in the large-width limit, and width vanishes as a hyperparameter. Unless finite width is important for your study, you should probably work mathematically in the large-width limit (and of course compare empirically to finite-width nets).

Since the experiments you run to match your theory won’t take place in an infinite or infinitesimal limit, we still have the question of how small is small enough (and ditto for “large”). Fortunately, numerical evidence suffices for this: divide \(\eta\) by another factor of 10, or multiply width by another factor of 10, and if the change in dynamics is negligible, you’re close enough to the limit.²

Almost all of our useful understanding of hyperparameters takes the form of hyperparameter scaling relationships.³ There are probably several important scaling relationships that remain to be worked out. After using all known relationships, you are usually still left with a handful of hyperparameters. The number of remaining hyperparameters often determines the difficulty of the calculation you have to do, so it’s a good idea to get it as low as possible.⁴

Without further ado, here are the hyperparameters we have theory for.

Width, initialization scale, and learning rate

This was mostly covered in the previous section. There’s essentially only one way to scale the layerwise init sizes and learning rates with model width such that you retain feature learning at large width. This scaling scheme is called the maximal-update parameterization, or \(\mu\)P.

The original paper here is [Yang and Hu (2021)]. Most people find that [Yang et al. (2023)] gives a simpler exposition. The core idea here is essential; this is the only hyperparameter scaling relationship in this section that’s mandatory for doing or reading most modern deep learning theory.
[Yang et al. (2022)]’s followup “\(\mu\)Transfer” paper showed that getting the scaling relationships here right can let you optimize your hyperparameters on a small model and scale them up to a large model, much like how civil engineers build scaled-down models to test the mechanics of proposed designs. This paper basically launched the modern study of hyperparameter scaling and is one of very few practically influential theory papers to date.

It’s worth noting that there are order-one constant prefactors at each layer that \(\mu\)P leaves undetermined. For example, \(\mu\)P tells us that the init scale for an intermediate layer should be such that \(\sigma_\text{eff} := \sigma * \sqrt{\text{width}}\) is an order-one, width-independent quantity, but it doesn’t tell us what the actual value should be. There is currently no theory that tells us what these should be.

Open Question 3.1: Optimal hyperparameters for a simple nonlinear model. In a simple but nontrivial model — say, a linear network of infinite width but finite depth, trained with population gradient descent — what are the optimal choices for the layerwise init scales and learning rates – not just the width scalings but also the constant prefactors? Are they the same or different between layers? Do empirics reveal discernible patterns that theory might aim to explain?

Quickstart Guide: Hyperparameter selection (and why theorists should care)

Classes of hyperparameter: optimization, architecture, and data

How to deal with hyperparameters as a theorist

Recurring motif: hyperparameter scaling relationships

Width, initialization scale, and learning rate

“Wider is better”

Depth

Batch size

Transformer-specific hyperparameters

Activation function

New frontiers

A Quickstart Guide to Learning Mechanics

Comments