Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
7516bcf2
Unverified
Commit
7516bcf2
authored
Aug 18, 2020
by
Romain Rigaux
Committed by
GitHub
Aug 18, 2020
Browse files
[docs] Fix number of 'ug' occurrences in tokenizer_summary (#6574)
parent
5a5af22e
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
1 addition
and
1 deletion
+1
-1
docs/source/tokenizer_summary.rst
docs/source/tokenizer_summary.rst
+1
-1
No files found.
docs/source/tokenizer_summary.rst
View file @
7516bcf2
...
@@ -130,7 +130,7 @@ Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our word
...
@@ -130,7 +130,7 @@ Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our word
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 +
2 +
5 = 2
2
` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
`10 + 5 + 5 = 2
0
` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
then it adds 'ug' to the vocabulary. Our corpus then becomes
::
::
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment