diff --git a/chapters/en/chapter1/5.mdx b/chapters/en/chapter1/5.mdx index f00f6643b..05f2d5713 100644 --- a/chapters/en/chapter1/5.mdx +++ b/chapters/en/chapter1/5.mdx @@ -87,7 +87,7 @@ Text classification involves assigning predefined categories to text documents, 2. BERT is pretrained with two objectives: masked language modeling and next-sentence prediction. In masked language modeling, some percentage of the input tokens are randomly masked, and the model needs to predict these. This solves the issue of bidirectionality, where the model could cheat and see all the words and "predict" the next word. The final hidden states of the predicted mask tokens are passed to a feedforward network with a softmax over the vocabulary to predict the masked word. - The second pretraining object is next-sentence prediction. The model must predict whether sentence B follows sentence A. Half of the time sentence B is the next sentence, and the other half of the time, sentence B is a random sentence. The prediction, whether it is the next sentence or not, is passed to a feedforward network with a softmax over the two classes (`IsNext` and `NotNext`). + The second pretraining objective is next-sentence prediction. The model must predict whether sentence B follows sentence A. Half of the time sentence B is the next sentence, and the other half of the time, sentence B is a random sentence. The prediction, whether it is the next sentence or not, is passed to a feedforward network with a softmax over the two classes (`IsNext` and `NotNext`). 3. The input embeddings are passed through multiple encoder layers to output some final hidden states.