read().splitlines() -> readlines()

splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', '']. Perhaps we should use readlines() instead.

read().splitlines() -> readlines()
splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', '']. Perhaps we should use readlines() instead.
897d0841 · Yiqing-Zhou · GitHub · 2f869dc6 · 897d0841
Unverified Commit 897d0841 authored Jul 22, 2019 by Yiqing-Zhou Committed by GitHub Jul 22, 2019
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 2 deletions

pytorch_transformers/tokenization_bert.py pytorch_transformers/tokenization_bert.py +1 -2

No files found.
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -67,10 +67,9 @@ def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    with open(vocab_file, "r", encoding="utf-8") as reader:
-        tokens = reader.read().splitlines()
+        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        vocab[token] = index
-        index += 1
    return vocab