- 24 May, 2024 1 commit
-
-
Ita Zaporozhets authored
* seems like `split_special_tokens` is used here * split special token * add new line at end of file * moving split special token test to common tests * added assertions * test * fixup * add co-author * passing rest of args to gptsan_japanese, fixing tests * removing direct comparison of fast and slow models * adding test support for UDOP and LayoutXLM * ruff fix * readd check if slow tokenizer * modify test to handle bos tokens * removing commented function * trigger build * applying review feedback - updated docstrings, var names, and simplified tests * ruff fixes * Update tests/test_tokenization_common.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * applying feedback, comments * shutil temp directory fix --------- Co-authored-by:
Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by:
Ita Zaporozhets <itazaporozhets@Itas-MBP.localdomain> Co-authored-by:
itazap <itazap@users.noreply.github.com> Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by:
Ita Zaporozhets <itazaporozhets@Itas-MacBook-Pro.local>
-
- 22 May, 2024 1 commit
-
-
Arthur authored
* update ruff version * fix research projects * Empty * Fix errors --------- Co-authored-by:Lysandre <lysandre@huggingface.co>
-
- 20 Mar, 2024 1 commit
-
-
Matt authored
* Add correct batched handling for apply_chat_template * Fix warning method * Add error for incompatible options * expand tests * Add a skip for markuplm * Add skips for other layout models * Skip for LayoutLMv2 * Slightly update the warning message * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> * typo fix * Update docstring for conversation kwarg * Update return docstring * Remove the warning, improve error message * Update src/transformers/tokenization_utils_base.py Co-authored-by:
amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/test_tokenization_common.py Co-authored-by:
amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/test_tokenization_common.py Co-authored-by:
amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Remove return_dict=None * Fix up some merge cruft * More merge cruft * Add another skip * Add another skip --------- Co-authored-by:
Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by:
amyeroberts <22614925+amyeroberts@users.noreply.github.com>
-
- 13 Mar, 2024 1 commit
-
-
Lysandre Debut authored
* Adds pretrained IDs directly in the tests * Fix tests * Fix tests * Review!
-
- 16 Nov, 2023 1 commit
-
-
Arthur authored
* try to stylify using ruff * might need to remove these changes? * use ruf format andruff check * use isinstance instead of type comparision * use # fmt: skip * use # fmt: skip * nits * soem styling changes * update ci job * nits isinstance * more files update * nits * more nits * small nits * check and format * revert wrong changes * actually use formatter instead of checker * nits * well docbuilder is overwriting this commit * revert notebook changes * try to nuke docbuilder * style * fix feature exrtaction test * remve `indent-width = 4` * fixup * more nits * update the ruff version that we use * style * nuke docbuilder styling * leve the print for detected changes * nits * Remove file I/O Co-authored-by:
charliermarsh <charlie.r.marsh@gmail.com> * style * nits * revert notebook changes * Add # fmt skip when possible * Add # fmt skip when possible * Fix * More ` # fmt: skip` usage * More ` # fmt: skip` usage * More ` # fmt: skip` usage * NIts * more fixes * fix tapas * Another way to skip * Recommended way * Fix two more fiels * Remove asynch Remove asynch --------- Co-authored-by:
charliermarsh <charlie.r.marsh@gmail.com>
-
- 14 Sep, 2023 1 commit
-
-
Matt authored
* First commit while I figure this out * make fixup * Remove unused method * Store prompt attrib * Fix prompt argument for tests * Make same changes in fast tokenizer * Remove global prompts from fast tokenizer too * stash commit * stash commit * Migrate PromptConfig to its True Final Location * Replace Conversation entirely with the new class * Import/dependency fixes * Import/dependency fixes * Change format for lots of default prompts * More default prompt fixups * Revert llama old methods so we can compare * Fix some default configs * Fix some default configs * Fix misspelled kwarg * Fixes for Blenderbot * make fixup * little rebase cleanup * Add basic documentation * Quick doc fix * Truncate docstring for now * Add handling for the case when messages is a single string * Quick llama merges * Update conversational pipeline and tests * Add a couple of legacy properties for backward compatibility * More legacy handling * Add docstring for build_conversation_input_ids * Restructure PromptConfig * Let's start T E M P L A T I N G * Refactor all default configs to use templates instead * Revert changes to the special token properties since we don't need them anymore * More class templates * Make the sandbox even sandier * Everything replaced with pure templating * Remove docs for PromptConfig * Add testing and optional requirement boilerplate * Fix imports and make fixup * Fix LLaMA tests and add Conversation docstring * Finally get LLaMA working with the template system * Finally get LLaMA working with the template system * make fixup * make fixup * fmt-off for the long lists of test tokens * Rename method to apply_chat_template for now * Start on documentation * Make chat_template a property that reads through to the default if it's not set * Expand docs * Expand chat templating doc some more * trim/lstrip blocks by default and update doc * Few doc tweaks * rebase cleanup * Clarify docstring * rebase cleanup * rebase cleanup * make fixup * Quick doc edit * Reformat the standard template to match ChatML * Re-add PEFT check * Update docs/source/en/chat_templating.md Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com> * Add apply_chat_template to the tokenizer doc * make fixup * Add doc links * Fix chat links * Fix chat links * Explain system messages in the doc * Add chat template test * Proper save-loading for chat template attribute * Add test skips for layout models * Remove _build_conversation_input_ids, add default_chat_template to code_llama * Make sure all LLaMA models are using the latest template * Remove default_system_prompt block in code_llama because it has no default prompt * Update ConversationPipeline preprocess * Add correct #Copied from links to the default_chat_templates * Remove unneeded type checking line * Add a dummy mark_processsed method * Reorganize Conversation to have **deprecated_kwargs * Update chat_templating.md * Quick fix to LLAMA tests * Small doc tweaks * Add proper docstrings and "copied from" statements to all default chat templates * Merge use_default_system_prompt support for code_llama too * Improve clarity around self.chat_template * Docstring fix * Fix blenderbot default template * More doctest fix * Break out some tokenizer kwargs * Update doc to explain default templates * Quick tweaks to tokenizer args * Cleanups for tokenizer args * Add note about cacheing * Quick tweak to the chat-templating doc * Update the LLaMA template with error checking and correct system message embedding * make fixup * make fixup * add requires_jinja * Cleanup to expected output formatting * Add cacheing * Fix typo in llama default template * Update LLaMA tests * Update documentation * Improved legacy handling in the Conversation class * Update Jinja template with proper error handling * Quick bugfix * Proper exception raising * Change cacheing behaviour so it doesn't try to pickle an entire Jinja env * make fixup * rebase cleanup --------- Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com>
-
- 18 Aug, 2023 1 commit
-
-
Arthur authored
* draft changes * update and add tests * styling for no * move test * path to usable model * update test * small update * update bertbased tokenizers * don'tuse kwargs for _tokenize * don'tuse kwargs for _tokenize * fix copies * update * update test for special tokenizers * fixup * skip two tests * remove pdb breakpiont() * wowo * rewrite custom tests * nits * revert chang in target keys * fix markup lm * update documentation of the argument
-
- 06 Feb, 2023 1 commit
-
-
Sylvain Gugger authored
* Result of black 23.1 * Update target to Python 3.7 * Switch flake8 to ruff * Configure isort * Configure isort * Apply isort with line limit * Put the right black version * adapt black in check copies * Fix copies
-
- 14 Nov, 2022 1 commit
-
-
Bartosz Szmelczynski authored
* First draft * Remove scatter dependency * Add require_torch * update vectorized sum test, add clone call * remove artifacts * fix style * fix style v2 * remove "scatter" mentions from the code base * fix isort error Co-authored-by:
Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by:
ydshieh <ydshieh@users.noreply.github.com>
-
- 02 Nov, 2022 1 commit
-
-
Ben Eyal authored
馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775) * Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped
-
- 24 Aug, 2022 1 commit
-
-
SaulLu authored
add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693) * add warning to let the user know that the method is slower that for a fast tokenizer * user warnings * fix layoutlmv2 * fix layout* * change warnings into logger.warning
-
- 12 May, 2022 1 commit
-
-
Sylvain Gugger authored
* Black preview * Fixup too! * Fix check copies * Use the same version as the CI * Bump black
-
- 03 May, 2022 1 commit
-
-
Yih-Dar authored
* move test model folders (TODO: fix imports and others) * fix (potentially partially) imports (in model test modules) * fix (potentially partially) imports (in tokenization test modules) * fix (potentially partially) imports (in feature extraction test modules) * fix import utils.test_modeling_tf_core * fix path ../fixtures/ * fix imports about generation.test_generation_flax_utils * fix more imports * fix fixture path * fix get_test_dir * update module_to_test_file * fix get_tests_dir from wrong transformers.utils * update config.yml (CircleCI) * fix style * remove missing imports * update new model script * update check_repo * update SPECIAL_MODULE_TO_TEST_MAP * fix style * add __init__ * update self-scheduled * fix add_new_model scripts * check one way to get location back * python setup.py build install * fix import in test auto * update self-scheduled.yml * update slack notification script * Add comments about artifact names * fix for yolos Co-authored-by:ydshieh <ydshieh@users.noreply.github.com>
-
- 23 Feb, 2022 1 commit
-
-
Lysandre Debut authored
* Per-folder tests reorganization Co-authored-by:
sgugger <sylvain.gugger@gmail.com> Co-authored-by:
Stas Bekman <stas@stason.org>
-
- 27 Jan, 2022 1 commit
-
-
SaulLu authored
* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com>
-
- 03 Jan, 2022 1 commit
-
-
Nicolas Patry authored
* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
-
- 28 Dec, 2021 1 commit
-
-
Patrick von Platen authored
* speed up canine and mluke * speed up mbart and mbart50 toks * upload files
-
- 03 Nov, 2021 1 commit
-
-
NielsRogge authored
* Add LayoutXLMTokenizer and LayoutXLMTokenizerFast * Fix styling issues * Fix more styling issues * Fix more styling issues * Fix docstring * Fix unit tests * Fix docs * Fix unit tests * Fix typos and styling issues * Fix styling issues * Fix docstring * Make all tests of test_tokenization_layoutxlm pass * Add LayoutXLMProcessor * Make fixup * Make all LayoutXLMProcessor tests pass * Minor fixes * Leave LayoutLMv2Processor tests unchanged * Fix code quality * Move LayoutXLM tokenizers and processor to separate folder * Fix code quality * Apply suggestions from code review * Replace assertions by value errors * Remove methods from fast tokenizer Co-authored-by:King Yiu Suen <kingyiusuen@gmail.com>
-
- 30 Aug, 2021 1 commit
-
-
NielsRogge authored
* First commit * Make style * Fix dummy objects * Add Detectron2 config * Add LayoutLMv2 pooler * More improvements, add documentation * More improvements * Add model tests * Add clarification regarding image input * Improve integration test * Fix bug * Fix another bug * Fix another bug * Fix another bug * More improvements * Make more tests pass * Make more tests pass * Improve integration test * Remove gradient checkpointing and add head masking * Add integration test * Add LayoutLMv2ForSequenceClassification to the tests * Add LayoutLMv2ForQuestionAnswering * More improvements * More improvements * Small improvements * Fix _LazyModule * Fix fast tokenizer * Move sync_batch_norm to a separate method * Replace dummies by requires_backends * Move calculation of visual bounding boxes to separate method + update README * Add models to main init * First draft * More improvements * More improvements * More improvements * More improvements * More improvements * Remove is_split_into_words * More improvements * Simply tesseract - no use of pandas anymore * Add LayoutLMv2Processor * Update is_pytesseract_available * Fix bugs * Improve feature extractor * Fix bug * Add print statement * Add truncation of bounding boxes * Add tests for LayoutLMv2FeatureExtractor and LayoutLMv2Tokenizer * Improve tokenizer tests * Make more tokenizer tests pass * Make more tests pass, add integration tests * Finish integration tests * More improvements * More improvements - update API of the tokenizer * More improvements * Remove support for VQA training * Remove some files * Improve feature extractor * Improve documentation and one more tokenizer test * Make quality and small docs improvements * Add batched tests for LayoutLMv2Processor, remove fast tokenizer * Add truncation of labels * Apply suggestions from code review * Improve processor tests * Fix failing tests and add suggestion from code review * Fix tokenizer test * Add detectron2 CI job * Simplify CI job * Comment out non-detectron2 jobs and specify number of processes * Add pip install torchvision * Add durations to see which tests are slow * Fix tokenizer test and make model tests smaller * Frist draft * Use setattr * Possible fix * Proposal with configuration * First draft of fast tokenizer * More improvements * Enable fast tokenizer tests * Make more tests pass * Make more tests pass * More improvements * Addd padding to fast tokenizer * Mkae more tests pass * Make more tests pass * Make all tests pass for fast tokenizer * Make fast tokenizer support overflowing boxes and labels * Add support for overflowing_labels to slow tokenizer * Add support for fast tokenizer to the processor * Update processor tests for both slow and fast tokenizers * Add head models to model mappings * Make style & quality * Remove Detectron2 config file * Add configurable option to label all subwords * Fix test * Skip visual segment embeddings in test * Use ResNet-18 backbone in tests instead of ResNet-101 * Proposal * Re-enable all jobs on CI * Fix installation of tesseract * Fix failing test * Fix index table * Add LayoutXLM doc page, first draft of code examples * Improve documentation a lot * Update expected boxes for Tesseract 4.0.0 beta * Use offsets to create labels instead of checking if they start with ## * Update expected boxes for Tesseract 4.1.1 * Fix conflict * Make variable names cleaner, add docstring, add link to notebooks * Revert "Fix conflict" This reverts commit a9b46ce9afe47ebfcfe7b45e6a121d49e74ef2c5. * Revert to make integration test pass * Apply suggestions from @LysandreJik's review * Address @patrickvonplaten's comments * Remove fixtures DocVQA in favor of dataset on the hub Co-authored-by:Lysandre <lysandre.debut@reseau.eseo.fr>
-