Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
7516bcf2
Unverified
Commit
7516bcf2
authored
Aug 18, 2020
by
Romain Rigaux
Committed by
GitHub
Aug 18, 2020
Browse files
[docs] Fix number of 'ug' occurrences in tokenizer_summary (#6574)
parent
5a5af22e
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
1 addition
and
1 deletion
+1
-1
docs/source/tokenizer_summary.rst
docs/source/tokenizer_summary.rst
+1
-1
No files found.
docs/source/tokenizer_summary.rst
View file @
7516bcf2
...
...
@@ -130,7 +130,7 @@ Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our word
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 +
2 +
5 = 2
2
` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
`10 + 5 + 5 = 2
0
` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
::
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment