Linear Attention
This blog records my understanding of linear attention. It seems that the linear attention module in hybrid attention has converged to Gated DeltaNet (GDN). But anyway, I will cover more than GDN.
Vanilla Linear Attention
The softmax causal attention can be formulated as:
- Training:
- Inference:
where is the causal attention mask.
Linear attention is just dot-product attention without softmax:
- Training:
- Inference:
The first equality holds because is a scalar, so it can be moved freely. The second equality follows from associativity.
We define . Therefore . In this way, dot-product attention without softmax can be interpreted as: the output at timestep , , is the query (input) reading information from the hidden state :
- , so the memory capacity is a matrix whose rank is at most .
- The update rule of the hidden state is .
[WIP]