Why gated activation functions? - Learning Mechanics

Open Question 3.13: Why gated activation functions? Modern transformer architectures often use gated activation functions like SwiGLU instead of ordinary pointwise nonlinearities like ReLU. SwiGLU in particular is puzzling compared to the original GLU activation function because it diverges quadratically as a layer input \(\mathbf{h}_\ell\) grows in norm, as opposed to ReLU and most of its common variants, which diverge linearly. As [Shazeer (2020)] says after proposing SwiGLU:

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.

So: whence the advantage of gated activation functions in large transformer models?

See question in context | See all open questions

This is a discussion page for the open question above. Feel free to share ideas, approaches, or relevant research in the comments below.

Discussion