Unverified Commit 7feba744 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

[Tokenizer doc] Clarification about `add_prefix_space` (#24368)



* nits

* more details

* fixup

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
parent 0527c1c0
...@@ -143,8 +143,10 @@ class GenerationConfig(PushToHubMixin): ...@@ -143,8 +143,10 @@ class GenerationConfig(PushToHubMixin):
If set to int > 0, all ngrams of that size can only occur once. If set to int > 0, all ngrams of that size can only occur once.
bad_words_ids(`List[List[int]]`, *optional*): bad_words_ids(`List[List[int]]`, *optional*):
List of token ids that are not allowed to be generated. In order to get the token ids of the words that List of token ids that are not allowed to be generated. In order to get the token ids of the words that
should not appear in the generated text, use `tokenizer(bad_words, add_prefix_space=True, should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing the
add_special_tokens=False).input_ids`. tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
force_words_ids(`List[List[int]]` or `List[List[List[int]]]`, *optional*): force_words_ids(`List[List[int]]` or `List[List[List[int]]]`, *optional*):
List of token ids that must be generated. If given a `List[List[int]]`, this is treated as a simple list of List of token ids that must be generated. If given a `List[List[int]]`, this is treated as a simple list of
words that must be included, the opposite to `bad_words_ids`. If given `List[List[List[int]]]`, this words that must be included, the opposite to `bad_words_ids`. If given `List[List[List[int]]]`, this
......
...@@ -546,8 +546,10 @@ class NoBadWordsLogitsProcessor(LogitsProcessor): ...@@ -546,8 +546,10 @@ class NoBadWordsLogitsProcessor(LogitsProcessor):
Args: Args:
bad_words_ids (`List[List[int]]`): bad_words_ids (`List[List[int]]`):
List of list of token ids that are not allowed to be generated. In order to get the token ids of the words List of list of token ids that are not allowed to be generated. In order to get the token ids of the words
that should not appear in the generated text, use `tokenizer(bad_words, add_prefix_space=True, that should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing
add_special_tokens=False).input_ids`. the tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
eos_token_id (`Union[int, List[int]]`): eos_token_id (`Union[int, List[int]]`):
The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens. The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
""" """
......
...@@ -292,7 +292,10 @@ class TFNoBadWordsLogitsProcessor(TFLogitsProcessor): ...@@ -292,7 +292,10 @@ class TFNoBadWordsLogitsProcessor(TFLogitsProcessor):
Args: Args:
bad_words_ids (`List[List[int]]`): bad_words_ids (`List[List[int]]`):
List of list of token ids that are not allowed to be generated. In order to get the tokens of the words List of list of token ids that are not allowed to be generated. In order to get the tokens of the words
that should not appear in the generated text, use `tokenizer(bad_word, add_prefix_space=True).input_ids`. that should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing
the tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
eos_token_id (`int`): eos_token_id (`int`):
The id of the *end-of-sequence* token. The id of the *end-of-sequence* token.
""" """
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment