Open Question 3.10

Why depth $\propto$ width in LLMs?


Open Question 3.10: Why depth $\propto$ width in LLMs? Why, judging by publicly-reported LLM architecture specs, is it seemingly optimal to scale transformer depth proportional to width?

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion