Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
18ca0e91
Unverified
Commit
18ca0e91
authored
Aug 19, 2020
by
Sylvain Gugger
Committed by
GitHub
Aug 19, 2020
Browse files
Fix #6575 (#6596)
parent
7581884d
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
8 additions
and
1 deletion
+8
-1
docs/source/preprocessing.rst
docs/source/preprocessing.rst
+6
-0
src/transformers/tokenization_utils_base.py
src/transformers/tokenization_utils_base.py
+2
-1
No files found.
docs/source/preprocessing.rst
View file @
18ca0e91
...
...
@@ -284,6 +284,12 @@ The tokenizer also accept pre-tokenized inputs. This is particularly useful when
predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
.. warning::
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
tokenizer. For instance, we have:
...
...
src/transformers/tokenization_utils_base.py
View file @
18ca0e91
...
...
@@ -1088,7 +1088,8 @@ ENCODE_KWARGS_DOCSTRING = r"""
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_pretokenized (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the input is already tokenized.
Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
will skip the pre-tokenization step. This is useful for NER or token classification.
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment