huggingface · syedmuhammadayyanibrar · Mar 5, 2026
diff --git a/chapters/en/chapter1/6.mdx b/chapters/en/chapter1/6.mdx
@@ -163,10 +163,12 @@ use a sparse version of the attention matrix to speed up training.
 > [!TIP]
 > Standard attention mechanisms have a computational complexity of O(n²), where n is the sequence length. This becomes problematic for very long sequences. The specialized attention mechanisms below help address this limitation.
 
-### LSH attention
+### LSH(Locality Sensitive Hashing) attention
 
 [Reformer](https://huggingface.co/docs/transformers/model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
-the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
+the keys k in K that are close to q. For this Queries and keys are hashed into buckets using LSH. Attention is computed within each bucket (not individually selecting neighbors per query).
+					```hash vectors -> group into buckets -> attend inside bucket```
+A hash function is used to determine if q and k are close. The attention mask is
 modified to mask the current token (except at the first position), because it will give a query and a key equal (so
 very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
 (determined by a n_rounds parameter) and then are averaged together.