• Thomas Wolf's avatar
    Cleanup fast tokenizers integration (#3706) · 827d6d6e
    Thomas Wolf authored
    
    
    * First pass on utility classes and python tokenizers
    
    * finishing cleanup pass
    
    * style and quality
    
    * Fix tests
    
    * Updating following @mfuntowicz comment
    
    * style and quality
    
    * Fix Roberta
    
    * fix batch_size/seq_length inBatchEncoding
    
    * add alignement methods + tests
    
    * Fix OpenAI and Transfo-XL tokenizers
    
    * adding trim_offsets=True default for GPT2 et RoBERTa
    
    * style and quality
    
    * fix tests
    
    * add_prefix_space in roberta
    
    * bump up tokenizers to rc7
    
    * style
    
    * unfortunately tensorfow does like these - removing shape/seq_len for now
    
    * Update src/transformers/tokenization_utils.py
    Co-Authored-By: default avatarStefan Schweter <stefan@schweter.it>
    
    * Adding doc and docstrings
    
    * making flake8 happy
    Co-authored-by: default avatarStefan Schweter <stefan@schweter.it>
    827d6d6e
distilbert.rst 3.77 KB