Add default bos_token and eos_token for tokenizer of deberta_v2 (#12429)

* fix ids_to_tokens naming error in tokenizer of deberta v2 * Update tokenization_deberta_v2.py Add bos_token and eos_token. * format code Co-authored-by: Jipeng Huang <jihuan@microsoft.com>

Add default bos_token and eos_token for tokenizer of deberta_v2 (#12429)
* fix ids_to_tokens naming error in tokenizer of deberta v2 * Update tokenization_deberta_v2.py Add bos_token and eos_token. * format code Co-authored-by: Jipeng Huang <jihuan@microsoft.com>
69f57015 · Jabin Huang · GitHub · c9486fd0 · 69f57015
Unverified Commit 69f57015 authored Jun 30, 2021 by Jabin Huang Committed by GitHub Jun 30, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 0 deletions

src/transformers/models/deberta_v2/tokenization_deberta_v2.py ...transformers/models/deberta_v2/tokenization_deberta_v2.py +11 -0

No files found.
--- a/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
@@ -60,6 +60,13 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
            contains the vocabulary necessary to instantiate a tokenizer.
        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to lowercase the input when tokenizing.
+        bos_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+            When building a sequence using special tokens, this is not the token that is used for the beginning of
+            sequence. The token used is the :obj:`cls_token`.
+        eos_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+            The end of sequence token. When building a sequence using special tokens, this is not the token that is
+            used for the end of sequence. The token used is the :obj:`sep_token`.
        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
@@ -101,6 +108,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
        vocab_file,
        do_lower_case=False,
        split_by_punct=False,
+        bos_token="[CLS]",
+        eos_token="[SEP]",
        unk_token="[UNK]",
        sep_token="[SEP]",
        pad_token="[PAD]",
@@ -113,6 +122,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
        super().__init__(
            do_lower_case=do_lower_case,
+            bos_token=bos_token,
+            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            pad_token=pad_token,