• Thomas Wolf's avatar
    Tokenizers API developments (#5103) · 11fdde02
    Thomas Wolf authored
    
    
    * Add return lengths
    
    * make pad a bit more flexible so it can be used as collate_fn
    
    * check all kwargs sent to encoding method are known
    
    * fixing kwargs in encodings
    
    * New AddedToken class in python
    
    This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.
    
    * style and quality
    
    * switched to hugginface tokenizers library for AddedTokens
    
    * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state
    
    * style and quality
    
    * do not raise an error on additional or unused kwargs for tokenize() but only a warning
    
    * transfo-xl pretrained model requires torch
    
    * Update src/transformers/tokenization_utils.py
    Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
    Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
    11fdde02
setup.py 5.99 KB