1. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  2. 11 Aug, 2020 1 commit
  3. 01 Jul, 2020 2 commits
  4. 16 Jun, 2020 1 commit
  5. 15 Jun, 2020 1 commit
    • Anthony MOI's avatar
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
      Anthony MOI authored
      
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
      
      * Use tokenizers pre-tokenized pipeline
      
      * failing pretrokenized test
      
      * Fix is_pretokenized in python
      
      * add pretokenized tests
      
      * style and quality
      
      * better tests for batched pretokenized inputs
      
      * tokenizers clean up - new padding_strategy - split the files
      
      * [HUGE] refactoring tokenizers - padding - truncation - tests
      
      * style and quality
      
      * bump up requied tokenizers version to 0.8.0-rc1
      
      * switched padding/truncation API - simpler better backward compat
      
      * updating tests for custom tokenizers
      
      * style and quality - tests on pad
      
      * fix QA pipeline
      
      * fix backward compatibility for max_length only
      
      * style and quality
      
      * Various cleans up - add verbose
      
      * fix tests
      
      * update docstrings
      
      * Fix tests
      
      * Docs reformatted
      
      * __call__ method documented
      Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      36434220
  6. 05 Jun, 2020 1 commit
  7. 19 May, 2020 1 commit