[RoBERTa] Minor clarifications to model doc (#31949)

* minor edits and clarifications * address comment Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[RoBERTa] Minor clarifications to model doc (#31949)
* minor edits and clarifications * address comment Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
79877106 · Bertrand Thia · GitHub · 12b6880c · 79877106
Unverified Commit 79877106 authored Jul 22, 2024 by Bertrand Thia Committed by GitHub Jul 22, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 12 deletions

docs/source/en/model_doc/roberta.md docs/source/en/model_doc/roberta.md +12 -12

No files found.
--- a/docs/source/en/model_doc/roberta.md
+++ b/docs/source/en/model_doc/roberta.md
@@ -51,19 +51,19 @@ This model was contributed by [julien-c](https://huggingface.co/julien-c). The o
 ## Usage tips
- This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup
+- This implementation is the same as [`BertModel`] with a minor tweak to the embeddings, as well as a setup
-  for Roberta pretrained models.
+  for RoBERTa pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
+- RoBERTa has the same architecture as BERT but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
  different pretraining scheme.
- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
+- RoBERTa doesn't have `token_type_ids`, so you don't need to indicate which token belongs to which segment. Just
-  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
+  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`).
- Same as BERT with better pretraining tricks:
+- RoBERTa is similar to BERT but with better pretraining techniques:
-    * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+    * Dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all.
-    * together to reach 512 tokens (so the sentences are in an order than may span several documents)
+    * Sentence packing: Sentences are packed together to reach 512 tokens (so the sentences are in an order that may span several documents).
-    * train with larger batches
+    * Larger batches: Training uses larger batches.
-    * use BPE with bytes as a subunit and not characters (because of unicode characters)
+    * Byte-level BPE vocabulary: Uses BPE with bytes as a subunit instead of characters, accommodating Unicode characters.
- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples.
+- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to its model page for usage examples.
 ## Resources