Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
aa0135f2
Unverified
Commit
aa0135f2
authored
Jan 12, 2022
by
Leandro von Werra
Committed by
GitHub
Jan 12, 2022
Browse files
fix: switch from slow to generic tokenizer class (#15122)
parent
27b819b0
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
2 additions
and
2 deletions
+2
-2
examples/research_projects/codeparrot/scripts/bpe_training.py
...ples/research_projects/codeparrot/scripts/bpe_training.py
+2
-2
No files found.
examples/research_projects/codeparrot/scripts/bpe_training.py
View file @
aa0135f2
...
...
@@ -2,7 +2,7 @@ from datasets import load_dataset
from
tqdm
import
tqdm
from
arguments
import
TokenizerTrainingArguments
from
transformers
import
GPT2
Tokenizer
,
HfArgumentParser
from
transformers
import
Auto
Tokenizer
,
HfArgumentParser
from
transformers.models.gpt2.tokenization_gpt2
import
bytes_to_unicode
...
...
@@ -17,7 +17,7 @@ parser = HfArgumentParser(TokenizerTrainingArguments)
args
=
parser
.
parse_args
()
# Base tokenizer
tokenizer
=
GPT2
Tokenizer
.
from_pretrained
(
args
.
base_tokenizer
)
tokenizer
=
Auto
Tokenizer
.
from_pretrained
(
args
.
base_tokenizer
)
base_vocab
=
list
(
bytes_to_unicode
().
values
())
# Load dataset
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment