Linear Attention

This blog records my understanding of linear attention. It seems that the linear attention module in hybrid attention has converged to Gated DeltaNet (GDN). But anyway, I will cover more than GDN.

Vanilla Linear Attention

The softmax causal attention can be formulated as:

Training:
Inference:

where is the causal attention mask.

Linear attention is just dot-product attention without softmax:

Training:
Inference:

The first equality holds because is a scalar, so it can be moved freely. The second equality follows from associativity.

We define . Therefore . In this way, dot-product attention without softmax can be interpreted as: the output at timestep , , is the query (input) reading information from the hidden state :

, so the memory capacity is a matrix whose rank is at most .
The update rule of the hidden state is .

[WIP]