Unverified Commit 897d0841 authored by Yiqing-Zhou's avatar Yiqing-Zhou Committed by GitHub
Browse files

read().splitlines() -> readlines()

splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', ''].
Perhaps we should use readlines() instead.
parent 2f869dc6
......@@ -67,10 +67,9 @@ def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, "r", encoding="utf-8") as reader:
tokens = reader.read().splitlines()
tokens = reader.readlines()
for index, token in enumerate(tokens):
vocab[token] = index
index += 1
return vocab
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment