"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "273617b86dbe5cd15afb795e994dffc44e09e2df"
Unverified Commit 9775b2eb authored by Catalin Voss's avatar Catalin Voss Committed by GitHub
Browse files

Allow tokenization of sequences > 512 for caching

For many applications requiring randomized data access, it's easier to cache the tokenized representations than the words. So why not turn this into a warning?
parent 2152bfea
...@@ -232,7 +232,7 @@ class OpenAIGPTTokenizer(object): ...@@ -232,7 +232,7 @@ class OpenAIGPTTokenizer(object):
else: else:
ids.append(self.encoder.get(token, 0)) ids.append(self.encoder.get(token, 0))
if len(ids) > self.max_len: if len(ids) > self.max_len:
raise ValueError( logger.warning(
"Token indices sequence length is longer than the specified maximum " "Token indices sequence length is longer than the specified maximum "
" sequence length for this OpenAI GPT model ({} > {}). Running this" " sequence length for this OpenAI GPT model ({} > {}). Running this"
" sequence through the model will result in indexing errors".format(len(ids), self.max_len) " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment