• Anthony MOI's avatar
    [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
    Anthony MOI authored
    
    [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
    
    * Use tokenizers pre-tokenized pipeline
    
    * failing pretrokenized test
    
    * Fix is_pretokenized in python
    
    * add pretokenized tests
    
    * style and quality
    
    * better tests for batched pretokenized inputs
    
    * tokenizers clean up - new padding_strategy - split the files
    
    * [HUGE] refactoring tokenizers - padding - truncation - tests
    
    * style and quality
    
    * bump up requied tokenizers version to 0.8.0-rc1
    
    * switched padding/truncation API - simpler better backward compat
    
    * updating tests for custom tokenizers
    
    * style and quality - tests on pad
    
    * fix QA pipeline
    
    * fix backward compatibility for max_length only
    
    * style and quality
    
    * Various cleans up - add verbose
    
    * fix tests
    
    * update docstrings
    
    * Fix tests
    
    * Docs reformatted
    
    * __call__ method documented
    Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
    Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
    36434220
test_tokenization_ctrl.py 2.53 KB