Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
fd32ebed
Unverified
Commit
fd32ebed
authored
Nov 20, 2018
by
Thomas Wolf
Committed by
GitHub
Nov 20, 2018
Browse files
Merge pull request #42 from weiyumou/master
Fixed UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2
parents
eed255a5
9ff2b7d8
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
2 additions
and
2 deletions
+2
-2
README.md
README.md
+1
-1
pytorch_pretrained_bert/tokenization.py
pytorch_pretrained_bert/tokenization.py
+1
-1
No files found.
README.md
View file @
fd32ebed
...
...
@@ -99,7 +99,7 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer
=
BertTokenizer
.
from_pretrained
(
'bert-base-uncased'
)
# Tokenized input
tokenized_
text
=
"Who was Jim Henson ? Jim Henson was a puppeteer"
text
=
"Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text
=
tokenizer
.
tokenize
(
text
)
# Mask a token that we will try to predict back with `BertForMaskedLM`
...
...
pytorch_pretrained_bert/tokenization.py
View file @
fd32ebed
...
...
@@ -65,7 +65,7 @@ def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab
=
collections
.
OrderedDict
()
index
=
0
with
open
(
vocab_file
,
"r"
)
as
reader
:
with
open
(
vocab_file
,
"r"
,
encoding
=
"utf8"
)
as
reader
:
while
True
:
token
=
convert_to_unicode
(
reader
.
readline
())
if
not
token
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment