"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "fe140464213b60dbc5dafa772fa13f0809e3d8b4"
Unverified Commit 79877106 authored by Bertrand Thia's avatar Bertrand Thia Committed by GitHub
Browse files

[RoBERTa] Minor clarifications to model doc (#31949)



* minor edits and clarifications

* address comment
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 12b6880c
...@@ -51,19 +51,19 @@ This model was contributed by [julien-c](https://huggingface.co/julien-c). The o ...@@ -51,19 +51,19 @@ This model was contributed by [julien-c](https://huggingface.co/julien-c). The o
## Usage tips ## Usage tips
- This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup - This implementation is the same as [`BertModel`] with a minor tweak to the embeddings, as well as a setup
for Roberta pretrained models. for RoBERTa pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a - RoBERTa has the same architecture as BERT but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
different pretraining scheme. different pretraining scheme.
- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just - RoBERTa doesn't have `token_type_ids`, so you don't need to indicate which token belongs to which segment. Just
separate your segments with the separation token `tokenizer.sep_token` (or `</s>`) separate your segments with the separation token `tokenizer.sep_token` (or `</s>`).
- Same as BERT with better pretraining tricks: - RoBERTa is similar to BERT but with better pretraining techniques:
* dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all * Dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all.
* together to reach 512 tokens (so the sentences are in an order than may span several documents) * Sentence packing: Sentences are packed together to reach 512 tokens (so the sentences are in an order that may span several documents).
* train with larger batches * Larger batches: Training uses larger batches.
* use BPE with bytes as a subunit and not characters (because of unicode characters) * Byte-level BPE vocabulary: Uses BPE with bytes as a subunit instead of characters, accommodating Unicode characters.
- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples. - [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to its model page for usage examples.
## Resources ## Resources
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment