"vscode:/vscode.git/clone" did not exist on "039d8d65fc19ac74a8c7917233eb2828c46c0fa7"
  • Joe Davison's avatar
    Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
    Joe Davison authored
    * Preserve spaces in GPT-2 tokenizers
    
    Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
    tokenizers, enabling correct BPE encoding. Automatically inserts a space
    in front of first token in encode function when adding special tokens.
    
    * Add tokenization preprocessing method
    
    * Add framework argument to pipeline factory
    
    Also fixes pipeline test issue. Each test input now treated as a
    distinct sequence.
    f1e8a51f
test_tokenization_roberta.py 5.68 KB