• Leo Gao's avatar
    Refactor to remove generate and fix some bad tokenization · 90e50b4c
    Leo Gao authored
    In particular, the following assumptions are FALSE in general:
    tokenize(context + continuation) = tokenize(context) + tokenize(continuation)
    len(tokenize(context + continuation)) = len(tokenize(context)) + len(tokenize(continuation))
    tokenize(context + continuation)[:len(tokenize(context))] = tokenize(context)
    
    So we need to tip-toe around the problem by being careful with how we do it.
    
    In particular, using Fast is not just for performance; while behavour of GPT2Tokenizer differs across Transformers 2 and 3, GPT2TokenizerFast doesn't.
    90e50b4c
base.py 3.42 KB