Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
99b9affa
"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "e9c23fa056f401a586a1691edf773d1b9b60f96d"
Unverified
Commit
99b9affa
authored
Jan 29, 2021
by
Ethan Chau
Committed by
GitHub
Jan 29, 2021
Browse files
Clarify use of unk_token in tokenizer docstrings (#9875)
parent
c2d0ffec
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
1 addition
and
11 deletions
+1
-11
src/transformers/tokenization_utils.py
src/transformers/tokenization_utils.py
+0
-3
src/transformers/tokenization_utils_base.py
src/transformers/tokenization_utils_base.py
+1
-8
No files found.
src/transformers/tokenization_utils.py
View file @
99b9affa
...
@@ -230,9 +230,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
...
@@ -230,9 +230,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
"""
"""
Converts a string in a sequence of tokens, using the tokenizer.
Converts a string in a sequence of tokens, using the tokenizer.
Note that, unlike Fast tokenizers (instances of PreTrainedTokenizerFast), this method won't replace the unknown
tokens with the `unk_token` yet (this is done in the `encode()` method)
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
(BPE/SentencePieces/WordPieces). Takes care of added tokens.
(BPE/SentencePieces/WordPieces). Takes care of added tokens.
...
...
src/transformers/tokenization_utils_base.py
View file @
99b9affa
...
@@ -2043,14 +2043,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
...
@@ -2043,14 +2043,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
def
tokenize
(
self
,
text
:
str
,
pair
:
Optional
[
str
]
=
None
,
add_special_tokens
:
bool
=
False
,
**
kwargs
)
->
List
[
str
]:
def
tokenize
(
self
,
text
:
str
,
pair
:
Optional
[
str
]
=
None
,
add_special_tokens
:
bool
=
False
,
**
kwargs
)
->
List
[
str
]:
"""
"""
Converts a string in a sequence of tokens, using the backend Rust tokenizer.
Converts a string in a sequence of tokens, replacing unknown tokens with the :obj:`unk_token`.
Note that this method behave differently between fast and slow tokenizers:
- in fast tokenizers (instances of :class:`~transformers.PreTrainedTokenizerFast`), this method will
replace the unknown tokens with the :obj:`unk_token`,
- in slow tokenizers (instances of :class:`~transformers.PreTrainedTokenizer`), this method keep unknown
tokens unchanged.
Args:
Args:
text (:obj:`str`):
text (:obj:`str`):
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment