Linear Attention

This blog records my understanding of linear attention. It seems that the linear attention module in hybrid attention has converged to Gated DeltaNet (GDN). But anyway, I will cover more than GDN.

Vanilla Linear Attention

The softmax causal attention can be formulated as:

where is the causal attention mask.

Linear attention is just dot-product attention without softmax:

The first equality holds because is a scalar, so it can be moved freely. The second equality follows from associativity.

We define . Therefore . In this way, dot-product attention without softmax can be interpreted as: the output at timestep , , is the query (input) reading information from the hidden state :

[WIP]