• Arthur's avatar
    🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909) · 2da88537
    Arthur authored
    
    
    * fix test for bart. Order is correct now let's skip BPEs
    
    * ouf
    
    * styling
    
    * fix bert....
    
    * slow refactoring
    
    * current updates
    
    * massive refactoring
    
    * update
    
    * NICE!
    
    * update to see where I am at
    
    * updates
    
    * update
    
    * update
    
    * revert
    
    * updates
    
    * updates
    
    * start supporting legacy_save
    
    * styling
    
    * big update
    
    * revert some changes
    
    * nits
    
    * nniiiiiice
    
    * small fixes
    
    * kinda fix t5 with new behaviour
    
    * major update
    
    * fixup
    
    * fix copies
    
    * today's updates
    
    * fix byt5
    
    * upfate
    
    * update
    
    * update
    
    * updates
    
    * update vocab size test
    
    * Barthez does not use not need the fairseq offset ids
    
    * super calll must be after
    
    * calll super
    
    * move all super init
    
    * move other super init
    
    * fixup
    
    * nits
    
    * more fixes
    
    * nits
    
    * more fixes
    
    * nits
    
    * more fix
    
    * remove useless files
    
    * ouch all of them are affected
    
    * and more!
    
    * small imporvements
    
    * no more sanitize token
    
    * more changes around unique no split tokens
    
    * partially fix more things
    
    * keep legacy save but add warning
    
    * so... more fixes
    
    * updates
    
    * guess deberta tokenizer could be nuked
    
    * fixup
    
    * fixup did some bad things
    
    * nuke it if it breaks
    
    * remove prints and pretrain fast from slow with new format.
    
    * fixups
    
    * Apply suggestions from code review
    Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
    
    * fiou
    
    * nit
    
    * by default specials should not be normalized?
    
    * update
    
    * remove brakpoint
    
    * updates
    
    * a lot of updates
    
    * fixup
    
    * fixes revert some changes to match fast
    
    * small nits
    
    * that makes it cleaner
    
    * fix camembert accordingly
    
    * update
    
    * some lest breaking changes
    
    * update
    
    * fixup
    
    * fix byt5 and whisper mostly
    
    * some more fixes, canine's byte vocab
    
    * fix gpt2
    
    * fix most of the perceiver tests (4 left)
    
    * fix layout lmv3
    
    * fixup
    
    * fix copies for gpt2 style
    
    * make sure to only warn once
    
    * fix perciever and gpt2 tests
    
    * some more backward compatibility: also read special tokens map because some ppl use it........////.....
    
    * fixup
    
    * add else when reading
    
    * nits
    
    * fresh updates
    
    * fix copies
    
    * will this make everything faster?
    
    * fixes
    
    * more fixes
    
    * update
    
    * more fixes
    
    * fixup
    
    * is the source of truth right?
    
    * sorry camembert for the troubles
    
    * current updates
    
    * fixup
    
    * update led
    
    * update
    
    * fix regression
    
    * fix single word
    
    * more model specific fixes
    
    * fix t5 tests
    
    * fixup
    
    * more comments
    
    * update
    
    * fix nllb
    
    * rstrip removed
    
    * small fixes
    
    * better handle additional_special_tokens and vocab sizes
    
    * fixing
    
    * styling
    
    * fix 4 / 21
    
    * fixup
    
    * fix nlbb's tests
    
    * some fixes
    
    * fix t5
    
    * fixes
    
    * style
    
    * fix canine tests
    
    * damn this is nice
    
    * nits
    
    * m2m100 nit
    
    * fixups
    
    * fixes!
    
    * fixup
    
    * stash
    
    * fix merge
    
    * revert bad change
    
    * fixup
    
    * correct order for code Llama
    
    * fix speecht5 post merge
    
    * styling
    
    * revert source of 11 fails
    
    * small nits
    
    * all changes in one go
    
    * fnet hack
    
    * fix 2 more tests
    
    * update based on main branch of tokenizers
    
    * fixup
    
    * fix VITS issues
    
    * more fixes
    
    * fix mgp test
    
    * fix camembert issues
    
    * oups camembert still has 2 failing tests
    
    * mluke fixes
    
    * decode fixes
    
    * small nits
    
    * nits
    
    * fix llama and vits
    
    * fix camembert
    
    * smal nits
    
    * more fixes when initialising a fast from a slow and etc
    
    * fix one of the last test
    
    * fix CPM tokenizer test
    
    * fixups
    
    * fix pop2piano
    
    * fixup
    
    * ️ Change tokenizers required version ️
    
    * ️ Change tokenizers required version ️
    
    * "tokenizers>=0.14,<0.15", don't forget smaller than
    
    * fix musicgen tests and pretraiendtokenizerfast
    
    * fix owlvit and all
    
    * update t5
    
    * fix 800 red
    
    * fix tests
    
    * fix the fix of the fix of t5
    
    * styling
    
    * documentation nits
    
    * cache _added_tokens_encoder
    
    * fixups
    
    * Nit
    
    * fix red tests
    
    * one last nit!
    
    * make eveything a lot simpler
    
    * Now it's over 😉
    
    
    
    * few small nits
    
    * Apply suggestions from code review
    Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
    
    * updates that work for now
    
    * tests that should no be skipped / changed and fixed next
    
    * fixup
    
    * i am ashamed
    
    * pushe the fix
    
    * update
    
    * fixups
    
    * nits
    
    * fix added_tokens_encoder
    
    * fix canine test
    
    * fix pegasus vocab
    
    * fix transfoXL
    
    * fixup
    
    * whisper needs to be fixed for train new
    
    * pegasus nits
    
    * more pegasus fixes
    
    * minor update
    
    * better error message in failed test
    
    * fix whisper failing test
    
    * fix whisper failing test
    
    * fix pegasus
    
    * fixup
    
    * fix **** pegasus
    
    * reset things
    
    * remove another file
    
    * attempts to fix the strange custome encoder and offset
    
    * nits here and there
    
    * update
    
    * fixup
    
    * nit
    
    * fix the whisper test
    
    * nits nits
    
    * Apply suggestions from code review
    Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
    
    * updates based on review
    
    * some small update to potentially remove
    
    * nits
    
    * import rlu cache
    
    * Update src/transformers/tokenization_utils_base.py
    Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
    
    * move warning to `from_pretrained`
    
    * update tests results now that the special tokens are always added
    
    ---------
    Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
    Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
    Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
    2da88537
setup.py 15.5 KB