• Arthur's avatar
    [`Tokenizer`] Fix slow and fast serialization (#26570) · ef7e9369
    Arthur authored
    * fix
    
    * last attempt
    
    * current work
    
    * fix forward compatibility
    
    * save all special tokens
    
    * current state
    
    * revert additional changes
    
    * updates
    
    * remove tokenizer.model
    
    * add a test and the fix
    
    * nit
    
    * revert one more break
    
    * fix typefield issue
    
    * quality
    
    * more tests
    
    * fix fields for FC
    
    * more nits?
    
    * new additional changes
    
    * how
    
    * some updates
    
    * simplify all
    
    * more nits
    
    * revert some things to original
    
    * nice
    
    * nits
    
    * a small hack
    
    * more nits
    
    * ahhaha
    
    * fixup
    
    * update
    
    * make test run on ci
    
    * use subtesting
    
    * update
    
    * Update .circleci/create_circleci_config.py
    
    * updates
    
    * fixup
    
    * nits
    
    * replace typo
    
    * fix the test
    
    * nits
    
    * update
    
    * None max dif pls
    
    * a partial fix
    
    * had to revert one thing
    
    * test the fast
    
    * updates
    
    * fixup
    
    * and more nits
    
    * more fixes
    
    * update
    
    * Oupsy 馃憗
    
    
    
    * nits
    
    * fix marian
    
    * on our way to heaven
    
    * Update src/transformers/models/t5/tokenization_t5.py
    Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
    
    * fixup
    
    * Update src/transformers/tokenization_utils_fast.py
    Co-authored-by: default avatarLeo Tronchon <leo.tronchon@gmail.com>
    
    * Update src/transformers/tokenization_utils_base.py
    Co-authored-by: default avatarLeo Tronchon <leo.tronchon@gmail.com>
    
    * fix phobert
    
    * skip some things, test more
    
    * nits
    
    * fixup
    
    * fix deberta
    
    * update
    
    * update
    
    * more updates
    
    * skip one test
    
    * more updates
    
    * fix camembert
    
    * can't test this one
    
    * more good fixes
    
    * kind of a major update
    
    - seperate what is only done in fast in fast init and refactor
    - add_token(AddedToken(..., speicla = True)) ignores it in fast
    - better loading
    
    * fixup
    
    * more fixups
    
    * fix pegasus and mpnet
    
    * remove skipped tests
    
    * fix phoneme tokenizer if self.verbose
    
    * fix individual models
    
    * update common tests
    
    * update testing files
    
    * all over again
    
    * nits
    
    * skip test for markup lm
    
    * fixups
    
    * fix order of addition in fast by sorting the added tokens decoder
    
    * proper defaults for deberta
    
    * correct default for fnet
    
    * nits on add tokens, string initialized to special if special
    
    * skip irrelevant herbert tests
    
    * main fixes
    
    * update test added_tokens_serialization
    
    * the fix for bart like models and class instanciating
    
    * update bart
    
    * nit!
    
    * update idefix test
    
    * fix whisper!
    
    * some fixup
    
    * fixups
    
    * revert some of the wrong chanegs
    
    * fixup
    
    * fixup
    
    * skip marian
    
    * skip the correct tests
    
    * skip for tf and flax as well
    
    ---------
    Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
    Co-authored-by: default avatarLeo Tronchon <leo.tronchon@gmail.com>
    ef7e9369
test_tokenization_camembert.py 11.1 KB