[docstring] Fix bert generation tokenizer (#26820)

* Remove BertGenerationTokenizer from objects to ignore The file BertGenerationTokenizer is removed from objects to ignore as a first step to fix docstring. * Docstrings fix for BertGenerationTokenizer Docstring fix is generated for BertGenerationTokenizer by using check_docstrings.py. * Fix docstring for BertGenerationTokenizer Added sep_token type and docstring in BertGenerationTokenizer.

[docstring] Fix bert generation tokenizer (#26820)
* Remove BertGenerationTokenizer from objects to ignore The file BertGenerationTokenizer is removed from objects to ignore as a first step to fix docstring. * Docstrings fix for BertGenerationTokenizer Docstring fix is generated for BertGenerationTokenizer by using check_docstrings.py. * Fix docstring for BertGenerationTokenizer Added sep_token type and docstring in BertGenerationTokenizer.
5c6b83cb · przemL · GitHub · 12cc1233 · 5c6b83cb · 5c6b83cb
Unverified Commit 5c6b83cb authored Oct 16, 2023 by przemL Committed by GitHub Oct 16, 2023
Showing with 6 additions and 3 deletions

src/transformers/models/bert_generation/tokenization_bert_generation.py ...rs/models/bert_generation/tokenization_bert_generation.py +6 -2

utils/check_docstrings.py utils/check_docstrings.py +0 -1

No files found.
--- a/src/transformers/models/bert_generation/tokenization_bert_generation.py
+++ b/src/transformers/models/bert_generation/tokenization_bert_generation.py
@@ -51,15 +51,19 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
        vocab_file (`str`):
            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
            contains the vocabulary necessary to instantiate a tokenizer.
-        eos_token (`str`, *optional*, defaults to `"</s>"`):
-            The end of sequence token.
        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The begin of sequence token.
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
+        sep_token (`str`, *optional*, defaults to `"<::::>"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
        sp_model_kwargs (`dict`, *optional*):
            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,

--- a/utils/check_docstrings.py
+++ b/utils/check_docstrings.py
@@ -94,7 +94,6 @@ OBJECTS_TO_IGNORE = [
    "BarthezTokenizerFast",
    "BeitModel",
    "BertConfig",
-    "BertGenerationTokenizer",
    "BertJapaneseTokenizer",
    "BertModel",
    "BertTokenizerFast",