Unverified Commit 21f28c34 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Fix #5507 (#5559)

* Fix #5507

* Fix formatting
parent 9d9b872b
...@@ -102,17 +102,26 @@ def get_pairs(word): ...@@ -102,17 +102,26 @@ def get_pairs(word):
class GPT2Tokenizer(PreTrainedTokenizer): class GPT2Tokenizer(PreTrainedTokenizer):
""" """
GPT-2 BPE tokenizer. Peculiarities: GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
- Byte-level Byte-Pair-Encoding This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
- Requires a space to start the input string => the encoding methods should be called with the be encoded differently whether it is at the beginning of the sentence (without space) or not:
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
:: ::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello" >>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> tokenizer("Hello world")['input_ids']
[15496, 995]
>>> tokenizer(" Hello world")['input_ids']
[18435, 995]
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods. should refer to the superclass for more information regarding methods.
......
...@@ -62,17 +62,26 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { ...@@ -62,17 +62,26 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
class RobertaTokenizer(GPT2Tokenizer): class RobertaTokenizer(GPT2Tokenizer):
""" """
Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities: Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
- Byte-level Byte-Pair-Encoding This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
- Requires a space to start the input string => the encoding methods should be called with the be encoded differently whether it is at the beginning of the sentence (without space) or not:
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
:: ::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello" >>> from transformers import RobertaTokenizer
>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
>>> tokenizer("Hello world")['input_ids']
[0, 31414, 232, 328, 2]
>>> tokenizer(" Hello world")['input_ids']
[0, 20920, 232, 2]
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods. should refer to the superclass for more information regarding methods.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment