Unverified Commit 69f57015 authored by Jabin Huang's avatar Jabin Huang Committed by GitHub
Browse files

Add default bos_token and eos_token for tokenizer of deberta_v2 (#12429)



* fix ids_to_tokens naming error in tokenizer of deberta v2

* Update tokenization_deberta_v2.py

Add bos_token and eos_token.

* format code
Co-authored-by: default avatarJipeng Huang <jihuan@microsoft.com>
parent c9486fd0
...@@ -60,6 +60,13 @@ class DebertaV2Tokenizer(PreTrainedTokenizer): ...@@ -60,6 +60,13 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
contains the vocabulary necessary to instantiate a tokenizer. contains the vocabulary necessary to instantiate a tokenizer.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`): do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to lowercase the input when tokenizing. Whether or not to lowercase the input when tokenizing.
bos_token (:obj:`string`, `optional`, defaults to "[CLS]"):
The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`string`, `optional`, defaults to "[SEP]"):
The end of sequence token. When building a sequence using special tokens, this is not the token that is
used for the end of sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. token instead.
...@@ -101,6 +108,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer): ...@@ -101,6 +108,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
vocab_file, vocab_file,
do_lower_case=False, do_lower_case=False,
split_by_punct=False, split_by_punct=False,
bos_token="[CLS]",
eos_token="[SEP]",
unk_token="[UNK]", unk_token="[UNK]",
sep_token="[SEP]", sep_token="[SEP]",
pad_token="[PAD]", pad_token="[PAD]",
...@@ -113,6 +122,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer): ...@@ -113,6 +122,8 @@ class DebertaV2Tokenizer(PreTrainedTokenizer):
super().__init__( super().__init__(
do_lower_case=do_lower_case, do_lower_case=do_lower_case,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token, unk_token=unk_token,
sep_token=sep_token, sep_token=sep_token,
pad_token=pad_token, pad_token=pad_token,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment