Compendium of Open Questions — Learning Mechanics

Big open directions

Coming soon — these will be drawn from our position paper.

Open questions from A Quickstart Guide to Learning Mechanics

Open Question 2.1: Convergence of wide $\mu$P networks. Under what conditions does a network in the infinite-width $\mu$P limit converge when optimized with gradient descent?

See question in context

Question-specific discussion page

Open Question 2.2: Framework for studying feature learning at large width. Is there a simple, computationally tractable calculational framework — potentially making realistic simplifying assumptions — that allows us to quantitatively study feature evolution of a general class of neural network in the rich regime and which requires tracking less information than the DMFT framework of [Bordelon and Pehlevan (2022)]?

See question in context

Question-specific discussion page

Open Question 2.3: Framework for studying feature learning at large width and depth. Is there a simple, computationally tractable calculational framework — potentially making realistic simplifying assumptions — that allows us to quantitatively study the feature evolution of an infinite-depth network in the rich regime?

See question in context

Question-specific discussion page

Open Question 3.1: Optimal hyperparameters for a simple nonlinear model. In a simple but nontrivial model — say, a linear network of infinite width but finite depth, trained with population gradient descent — what are the optimal choices for the layerwise init scales and learning rates – not just the width scalings but also the constant prefactors? Are they the same or different between layers? Do empirics reveal discernible patterns that theory might aim to explain?

See question in context

Question-specific discussion page

Open Question 3.2: Scaling relationships for learning rate schedules. What scaling rules or relationships apply to learning rate schedules? What nondimensionalized quantities emerge? Can we “post-dict” properties of common learning rate schedules used in practice?

See question in context

Question-specific discussion page

Open Question 3.3: Is richer better? Research by [Atanasov et al. (2024)] finds that, in online training, networks with larger richness parameter $\gamma$ generalize better (so long as they’re given enough training time to escape the initial plateau). Is this generally true? Why?

See question in context

Question-specific discussion page

Open Question 3.4: Is wider better? Can it be shown that, when all hyperparameters are all optimally tuned, a wider MLP performs better on average on arbitrary tasks (perhaps under some reasonable assumptions on task structure)?

See question in context

Question-specific discussion page

Open Question 3.5: "Wider is better" counterexample. Is there a nontrivial example of a task for which a wider network does not perform better, even when all other hyperparameters are optimally tuned?

See question in context

Question-specific discussion page

Open Question 3.6: Is deeper better? Can it be shown that, when all hyperparameters are all optimally tuned, a deeper MLP performs better on average on arbitrary tasks (perhaps under some reasonable assumptions on task structure)?

See question in context

Question-specific discussion page

Open Question 3.7: "Deeper is better" counterexample. Is there a nontrivial example of a task for which a deeper network does not perform better, even when all other hyperparameters are optimally tuned?

See question in context

Question-specific discussion page

Open Question 3.8: Explaining compute-optimal batch sizes. Why does the batch size prescription of [McCandlish et al. (2018)] based on an assumption of isotropic, quadratic loss nonetheless predict compute-optimal batch size in a variety of realistic tasks?

See question in context

Question-specific discussion page

Open Question 3.9: Why tokens $\propto$ parameters in LLMs? Why is the compute-optimal prescription for LLMs a fixed number of tokens per parameter? A good place to start may be a study of random feature regression, in which the eigenframework of e.g. [Simon et al. (2024)] will correctly predict that the number of parameters and number of samples should scale proportionally for compute-optimal performance. Can a more general argument be extracted from consideration of this simple model? The correctness of a proposed explanation should be confirmed by making some new prediction that can be tested with transformers, such as how changing the task difficulty affects the optimal tokens-to-parameters ratio.

See question in context

Question-specific discussion page

Open Question 3.10: Why depth $\propto$ width in LLMs? Why, judging by publicly-reported LLM architecture specs, is it seemingly optimal to scale transformer depth proportional to width?

See question in context

Question-specific discussion page

Open Question 3.11: Hyperparameter scaling for MoEs. What hyperparameter scaling prescriptions apply to mixtures of experts? Can the central arguments of $\mu$P be imported and used to obtain initialization scales and learning rates that give rich training at infinite width?

See question in context

Question-specific discussion page

Open Question 3.12: Why ReLU? Why is ReLU close to the optimal activation function for most deep learning applications? A scientific answer to this question should include calculations and convincing experiments that make the case.

See question in context

Question-specific discussion page

Open Question 3.13: Why gated activation functions? Modern transformer architectures often use gated activation functions like SwiGLU instead of ordinary pointwise nonlinearities like ReLU. SwiGLU in particular is puzzling compared to the original GLU activation function because it diverges quadratically as a layer input $\mathbf{h}_\ell$ grows in norm, as opposed to ReLU and most of its common variants, which diverge linearly. As [Shazeer (2020)] says after proposing SwiGLU:

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.

So: whence the advantage of gated activation functions in large transformer models?

See question in context

Question-specific discussion page

Open Question 3.14: What's even going on with norm layers? What scaling relationships apply to norm layers embedded within deep neural networks? We’re interested here in both hyperparameter scaling prescriptions like $\mu$P and empirical scaling relationships which relate, say, the number or strength of norm layers to statistics of model weights, representations, or performance.

See question in context

Question-specific discussion page

Open Question 3.15: Do we really need norm layers? There is a feeling among practitioners and theorists alike that norm layers are somewhat unnatural. Can their effect on forward-propagation and training be characterized well enough that they can be replaced by something more mathematically elegant? Even if this does not yield better performance, it would be a step towards an interpretable science of large models.

See question in context

Question-specific discussion page

Open Question 3.16: What's even going on with Adam? What scaling relationships apply to Adam’s $\beta$ or $\epsilon$ hyperparameters?

See question in context

Question-specific discussion page

Open Question 4.1: Deep linear net dynamics in real networks. To what extent do deep linear network dynamics (e.g. greedy low-rank progression) carry over to nonlinear networks trained in practice?

See question in context

Question-specific discussion page