• Joe Davison's avatar
    Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
    Joe Davison authored
    * Preserve spaces in GPT-2 tokenizers
    
    Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
    tokenizers, enabling correct BPE encoding. Automatically inserts a space
    in front of first token in encode function when adding special tokens.
    
    * Add tokenization preprocessing method
    
    * Add framework argument to pipeline factory
    
    Also fixes pipeline test issue. Each test input now treated as a
    distinct sequence.
    f1e8a51f
test_tokenization_common.py 22.9 KB