Open Question 3.10
Why depth $\propto$ width in LLMs?
Open Question 3.10: Why depth $\propto$ width in LLMs? Why, judging by publicly-reported LLM architecture specs, is it seemingly optimal to scale transformer depth proportional to width?
This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.
Discussion