Unverified Commit 5c6b83cb authored by przemL's avatar przemL Committed by GitHub
Browse files

[docstring] Fix bert generation tokenizer (#26820)

* Remove BertGenerationTokenizer from objects to ignore

The file BertGenerationTokenizer is removed from
objects to ignore as a first step to fix docstring.

* Docstrings fix for BertGenerationTokenizer

Docstring fix is generated for BertGenerationTokenizer
by using check_docstrings.py.

* Fix docstring for BertGenerationTokenizer

Added sep_token type and docstring in BertGenerationTokenizer.
parent 12cc1233
...@@ -51,15 +51,19 @@ class BertGenerationTokenizer(PreTrainedTokenizer): ...@@ -51,15 +51,19 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
vocab_file (`str`): vocab_file (`str`):
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
contains the vocabulary necessary to instantiate a tokenizer. contains the vocabulary necessary to instantiate a tokenizer.
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
bos_token (`str`, *optional*, defaults to `"<s>"`): bos_token (`str`, *optional*, defaults to `"<s>"`):
The begin of sequence token. The begin of sequence token.
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
unk_token (`str`, *optional*, defaults to `"<unk>"`): unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`): pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths. The token used for padding, for example when batching sequences of different lengths.
sep_token (`str`, *optional*, defaults to `"<::::>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
sp_model_kwargs (`dict`, *optional*): sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
......
...@@ -94,7 +94,6 @@ OBJECTS_TO_IGNORE = [ ...@@ -94,7 +94,6 @@ OBJECTS_TO_IGNORE = [
"BarthezTokenizerFast", "BarthezTokenizerFast",
"BeitModel", "BeitModel",
"BertConfig", "BertConfig",
"BertGenerationTokenizer",
"BertJapaneseTokenizer", "BertJapaneseTokenizer",
"BertModel", "BertModel",
"BertTokenizerFast", "BertTokenizerFast",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment