lm_eval/base.py · 90e50b4c977ffb4a8f5903280cd230ed7953b4f9 · gaoqiong / lm-evaluation-harness

Refactor to remove generate and fix some bad tokenization · 90e50b4c

Leo Gao authored Nov 30, 2020

In particular, the following assumptions are FALSE in general:
tokenize(context + continuation) = tokenize(context) + tokenize(continuation)
len(tokenize(context + continuation)) = len(tokenize(context)) + len(tokenize(continuation))
tokenize(context + continuation)[:len(tokenize(context))] = tokenize(context)

So we need to tip-toe around the problem by being careful with how we do it.

In particular, using Fast is not just for performance; while behavour of GPT2Tokenizer differs across Transformers 2 and 3, GPT2TokenizerFast doesn't.

90e50b4c

base.py 3.42 KB

Replace base.py