• Joshua Lochner's avatar
    Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191) · 6e2d04e4
    Joshua Lochner authored
    * Remove user-defined tokens which can be obtained through merges
    
    * Remove debug line
    
    * formatting
    
    * Refactor spm slow -> fast converter
    
    * revert unnecessary refactor
    
    * set comprehension
    
    * remove test files
    
    * Use `vocab_scores`
    
    * Always replace spiece underline with space in decode
    
    * we no longer need token filtering
    
    * Add save fast load slow unit test
    
    * Remove tokenizers version check
    
    * Remove duplicate code
    
    * Make `<start_of_turn>` and `<end_of_turn>` special tokens
    
    * Bias merge priority with length if score is the same
    
    * Add unit test for merge priority
    
    * CI
    6e2d04e4
test_tokenization_gemma.py 25.4 KB