From e8747d9907f0f5255aa3c9ed597777c3878b95d9 Mon Sep 17 00:00:00 2001 From: ab490 Date: Sun, 8 Mar 2026 22:48:54 -0400 Subject: [PATCH] Fix missing comma in GRPO equation --- chapters/en/chapter12/3b.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/chapter12/3b.mdx b/chapters/en/chapter12/3b.mdx index a849c0b9d..e4d16a275 100644 --- a/chapters/en/chapter12/3b.mdx +++ b/chapters/en/chapter12/3b.mdx @@ -84,7 +84,7 @@ The final step is to use these advantage values to update our model so that it b The target function for policy update is: -$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ +$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ This formula might look intimidating at first, but it's built from several components that each serve an important purpose. Let's break them down one by one.