Allow tokenization of sequences > 512 for caching

For many applications requiring randomized data access, it's easier to cache the tokenized representations than the words. So why not turn this into a warning?

Allow tokenization of sequences > 512 for caching
For many applications requiring randomized data access, it's easier to cache the tokenized representations than the words. So why not turn this into a warning?
9775b2eb · Catalin Voss · GitHub · 2152bfea · 9775b2eb
Unverified Commit 9775b2eb authored Mar 02, 2019 by Catalin Voss Committed by GitHub Mar 02, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

pytorch_pretrained_bert/tokenization_openai.py pytorch_pretrained_bert/tokenization_openai.py +1 -1

No files found.
--- a/pytorch_pretrained_bert/tokenization_openai.py
+++ b/pytorch_pretrained_bert/tokenization_openai.py
@@ -232,7 +232,7 @@ class OpenAIGPTTokenizer(object):
            else:
                ids.append(self.encoder.get(token, 0))
        if len(ids) > self.max_len:
-            raise ValueError(
+            logger.warning(
                "Token indices sequence length is longer than the specified maximum "
                " sequence length for this OpenAI GPT model ({} > {}). Running this"
                " sequence through the model will result in indexing errors".format(len(ids), self.max_len)