diff --git a/chapters/en/chapter1/6.mdx b/chapters/en/chapter1/6.mdx index 049078cf3..6ba12181e 100644 --- a/chapters/en/chapter1/6.mdx +++ b/chapters/en/chapter1/6.mdx @@ -163,10 +163,12 @@ use a sparse version of the attention matrix to speed up training. > [!TIP] > Standard attention mechanisms have a computational complexity of O(n²), where n is the sequence length. This becomes problematic for very long sequences. The specialized attention mechanisms below help address this limitation. -### LSH attention +### LSH(Locality Sensitive Hashing) attention [Reformer](https://huggingface.co/docs/transformers/model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only -the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is +the keys k in K that are close to q. For this Queries and keys are hashed into buckets using LSH. Attention is computed within each bucket (not individually selecting neighbors per query). + ```hash vectors -> group into buckets -> attend inside bucket``` +A hash function is used to determine if q and k are close. The attention mask is modified to mask the current token (except at the first position), because it will give a query and a key equal (so very similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by a n_rounds parameter) and then are averaged together.