Why tokens $\propto$ parameters in LLMs?

Open Question 3.9: Why tokens $\propto$ parameters in LLMs? Why is the compute-optimal prescription for LLMs a fixed number of tokens per parameter? A good place to start may be a study of random feature regression, in which the eigenframework of e.g. [Simon et al. (2024)] will correctly predict that the number of parameters and number of samples should scale proportionally for compute-optimal performance. Can a more general argument be extracted from consideration of this simple model? The correctness of a proposed explanation should be confirmed by making some new prediction that can be tested with transformers, such as how changing the task difficulty affects the optimal tokens-to-parameters ratio.

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion