diff --git a/chapters/en/chapter1/8.mdx b/chapters/en/chapter1/8.mdx index 6be1cb515..3401ecc2a 100644 --- a/chapters/en/chapter1/8.mdx +++ b/chapters/en/chapter1/8.mdx @@ -63,7 +63,7 @@ The prefill phase is like the preparation stage in cooking - it's where all the 2. **Embedding Conversion**: Transforming these tokens into numerical representations that capture their meaning 3. **Initial Processing**: Running these embeddings through the model's neural networks to create a rich understanding of the context -This phase is computationally intensive because it needs to process all input tokens at once. Think of it as reading and understanding an entire paragraph before starting to write a response. +This phase is computationally intensive because it needs to process all input tokens at once and it populates the *KV Cache* with the *Keys* and *Values* for all prompt tokens to avoid redundant math later. Think of it as reading and understanding an entire paragraph before starting to write a response. You can experiment with different tokenizers in the interactive playground below: @@ -86,6 +86,12 @@ The decode phase involves several key steps that happen for each new token: This phase is memory-intensive because the model needs to keep track of all previously generated tokens and their relationships. +| Phase | Operation | GPU Utilization | Goal | +| :--- | :--- | :--- | :--- | +| **Prefill** | Parallel (All-at-once) | High (Compute-bound) | Build Cache + First Word | +| **Decoding** | Sequential (One-by-one) | Low (Memory-bound) | Finish the Sentence | + + ## Sampling Strategies Now that we understand how the model generates text, let's explore the various ways we can control this generation process. Just like a writer might choose between being more creative or more precise, we can adjust how the model makes its token selections.