"Deeper is better" counterexample.

Open Question 3.7: "Deeper is better" counterexample. Is there a nontrivial example of a task for which a deeper network does not perform better, even when all other hyperparameters are optimally tuned?

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion