Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
1a113fcf
"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "69f948461faae64ac3936cf5b7c569d2423d13c5"
Unverified
Commit
1a113fcf
authored
Jun 15, 2023
by
Belladore
Committed by
GitHub
Jun 15, 2023
Browse files
Update tokenizer_summary.mdx (grammar) (#24286)
parent
c3ca346b
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
1 addition
and
1 deletion
+1
-1
docs/source/en/tokenizer_summary.mdx
docs/source/en/tokenizer_summary.mdx
+1
-1
No files found.
docs/source/en/tokenizer_summary.mdx
View file @
1a113fcf
...
...
@@ -141,7 +141,7 @@ words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](mode
[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses
Spacy and ftfy, to count the frequency of each word in the training corpus.
After pre-tokenization, a set of unique words has been created and the frequency
of
each word
it
occurred in the
After pre-tokenization, a set of unique words has been created and the frequency
with which
each word occurred in the
training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment