Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions chapters/en/chapter1/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -163,10 +163,12 @@ use a sparse version of the attention matrix to speed up training.
> [!TIP]
> Standard attention mechanisms have a computational complexity of O(n²), where n is the sequence length. This becomes problematic for very long sequences. The specialized attention mechanisms below help address this limitation.

### LSH attention
### LSH(Locality Sensitive Hashing) attention

[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
the keys k in K that are close to q. For this Queries and keys are hashed into buckets using LSH. Attention is computed within each bucket (not individually selecting neighbors per query).
```hash vectors -> group into buckets -> attend inside bucket```
A hash function is used to determine if q and k are close. The attention mask is
modified to mask the current token (except at the first position), because it will give a query and a key equal (so
very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
(determined by a n_rounds parameter) and then are averaged together.
Expand Down