Linformer: Self-Attention with Linear Complexity
Linformer: Self-Attention with Linear Complexity
Abstract
Because of the standard self-attention mechanism of transformer uses O ( n 2 ) O(n^2) O(n2) time and space with respect to sequence length, this paper demonstrates that self-attention mechanism can be approximated by a low-rank matrix and further proposes a new self-attention mechanism by introducing 2 projection matrices, which reduces the overall complexity to O ( n ) O(n) O(n) and the performance near the standard self-attention.
Introduction
-
Main problem: the complexity of self-attention of standard transformer which is O ( n 2 ) O(n^2) O(n2)
-
Goal: to seek a new operation to avoid quadratic operation.
-
Pior work:
Backgrounds
Let denote the P matrix as this:
The Transformer uses P to capture the input context for a given token, in other words, the P is the attention weight. But for compute this ,we must cost quadratic time and space.
Self-Attention is Low Rank
This paper demonstrates that self-attention is low-rank both theoretically and empiriantly, i.e., the attention weight matrix P is low-rank. the experimental result is below:
In above left picture, we can see a clear long-tail spectrum distribution. This implies that most of the information of matrix P can be recovered from the first few largest singular values. In above right picture, this phenomenon is more obvious, meaning that, in higher layers, more information is concentrated in the largest singular values and the rank of P is lower.
Given this property of attention weight matrix P, we can use singular value decomposition to approximate P with a low-rank matrix P_low directly, as follows:
We need to use the first k singular values and corresponding singular vectors to approximate P with the e error, this way requires SVD operation which adds additional complexity, however. Therefore, this paper proposes another approach which uses learning projection matrix to avoid this added complexity.
Model
The main idea is to add two linear projection matrices E_i, F_i ( n * k) when computing key and value.
The bottom right In above picture,we can see how the dimentions change in the new mechanism. We can gain the context embedding for each head_i, as follows:
The dimention of the result context embedding is still n * d, but the time and space costing is linear. This paper giving some mathematical demonstations to improve this mechanism is reliable.
This paper suggests that Several additional techniques can be introduced on top of Linformer to further optimize for both performance and efficiency:
-
Parameter sharing between projections:
Headwise sharing、 Key-value sharing、 Layerwise sharing.
-
Nonuniform projected dimension:
We can choose a smaller projected dimension k for higher layers.
-
General projections:
One can also choose different kinds of low-dimensional projection methods instead of a simple linear projection.
Experiments
-
Pretraining
- Compared with RoBERTa which is based on the Transformer.
- Dataset: BookCorpus plus English Wikipedia
- Task: maskes language modeling
- Measure: perplexity
- Device: 64 Tesla V100 GPUs with 250k updates
Projected dimension k is effective, the performance is already nearly with the original Transformer when k = 128 for n = 512 and k = 256 for n = 1024. ((a) and (b))
We can decrease the number of additional parameters in our model because the performance of shared models almost matches that of the the non-shared model. ©
as sequence length increases, even though our projected dimension is fixed, the final perplexities after convergence remain about the same. (d)
- Downstream Results
The performance of Linformer is near that of the origin Transformer and the performance of Linformer model is mainly determined by the projected dimension k instead of the ratio n=k.
- Inference-time Efficiency Results
As sequence length increases,the inference-time speed-up and memory savings are even more dramatic.