"examples/vscode:/vscode.git/clone" did not exist on "5011efbec81a7a1d094a2eda8bde2b74613ca8b8"
- 14 Oct, 2022 1 commit
-
-
Sylvain Gugger authored
-
- 27 Sep, 2022 1 commit
-
-
Sylvain Gugger authored
* More tests for regression in cached non existence * Style
-
- 16 Sep, 2022 2 commits
-
-
Sylvain Gugger authored
-
Sylvain Gugger authored
* Fix tokenizer load from one file * Add a test * Style Co-authored-by:Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 15 Sep, 2022 1 commit
-
-
Sylvain Gugger authored
* Fix CI for custom tokenizers * Add nightly tests * Run CI, run! * Fix paths * Typos * Fix test
-
- 29 Aug, 2022 1 commit
-
-
Lucain authored
-
- 24 Aug, 2022 1 commit
-
-
SaulLu authored
add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693) * add warning to let the user know that the method is slower that for a fast tokenizer * user warnings * fix layoutlmv2 * fix layout* * change warnings into logger.warning
-
- 05 Aug, 2022 1 commit
-
-
Sylvain Gugger authored
* Draft new cached_file * Initial draft for config and model * Small fixes * Fix first batch of tests * Look in cache when internet is down * Fix last tests * Bad black, not fixing all quality errors * Make diff less * Implement change for TF and Flax models * Add tokenizer and feature extractor * For compatibility with main * Add utils to move the cache and auto-do it at first use. * Quality * Deal with empty commit shas * Deal with empty etag * Address review comments
-
- 01 Aug, 2022 1 commit
-
-
Sylvain Gugger authored
* Rewrite push_to_hub to use upload_files * Adapt the doc a bit * Address review comments and clean doc
-
- 11 Jul, 2022 1 commit
-
-
Yulv-git authored
* Fix some typos. Signed-off-by:
Yulv-git <yulvchi@qq.com> * Fix typo. Signed-off-by:
Yulv-git <yulvchi@qq.com> * make fixup.
-
- 23 Jun, 2022 1 commit
-
-
Guillaume Klein authored
Co-authored-by:SaulLu <55560583+SaulLu@users.noreply.github.com>
-
- 21 Jun, 2022 1 commit
-
-
Lysandre Debut authored
* Prepare CI for v0.8.0 * pin hfh (revert before merge) * Revert "pin hfh (revert before merge)" This reverts commit a0103140e1c77b810ffcb735192968bc03be3e1f. * Test rc3 * Test latest rc * Unpin to the RC Co-authored-by:Sylvain Gugger <Sylvain.gugger@gmail.com>
-
- 31 May, 2022 1 commit
-
-
Patrick von Platen authored
[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review
-
- 12 May, 2022 1 commit
-
-
Sylvain Gugger authored
* Black preview * Fixup too! * Fix check copies * Use the same version as the CI * Bump black
-
- 13 Apr, 2022 1 commit
-
-
davidleonfdez authored
* Fix setters of *_token_id properties of SpecialTokensMixin * Test setters of common tokens ids * Move to a separate test checks of setters of tokens ids * Add independent test for ByT5 * Add Canine test * Test speech to text
-
- 04 Apr, 2022 1 commit
-
-
SaulLu authored
* add new tests * add comment to overridden tests
-
- 23 Mar, 2022 1 commit
-
-
Sylvain Gugger authored
* Make Transformers use cache files when hf.co is down * Fix tests * Was there a random circleCI failure? * Isolate patches * Style * Comment out the failure since it doesn't fail anymore * Better comment
-
- 15 Feb, 2022 1 commit
-
-
Sylvain Gugger authored
* Allow custom code for Processors * Add more test * Test all auto_map configs are properly set
-
- 02 Feb, 2022 2 commits
-
-
SaulLu authored
* change truncation_side in init of `PreTrainedTokenizerBase` Co-authored-by:
LSinev <LSinev@users.noreply.github.com> * add test * Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`" This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843. * fix kwargs * Revert "fix kwargs" This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f. * Update tests/test_tokenization_common.py Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> * delete truncation_side variable * reorganize test * format * complete doc * Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"" This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80. * fix typo * fix typos to render documentation * Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""" This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0. * format Co-authored-by:
LSinev <LSinev@users.noreply.github.com> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Sylvain Gugger authored
* Allow dynamic modules to use relative imports * Work for configs * Fix last merge conflict * Save code of registered custom objects * Map strings to strings * Fix test * Add tokenizer * Rework tests * Tests * Ignore fixtures py files for tests * Tokenizer test + fix collection * With full path * Rework integration * Fix typo * Remove changes in conftest * Test for tokenizers * Add documentation * Update docs/source/custom_models.mdx Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Add file structure and file content * Add more doc * Style * Update docs/source/custom_models.mdx Co-authored-by:
Suraj Patil <surajp815@gmail.com> * Address review comments Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Suraj Patil <surajp815@gmail.com>
-
- 01 Feb, 2022 2 commits
-
-
SaulLu authored
fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319) * add new test * update test * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py` * add `tokenizer_file` for the fast only tokenizer * change global variables layoutxml * remove `"tokenizer_file"` from DPR tokenizer's Global variables * remove `tokenizer_file` from herbert slow tokenizer init * `"tokenizer_file"` from LED tokenizer's Global variables * remove `tokenizer_file` from mbart slow tokenizer init * remove `tokenizer_file` from slow tokenizer template * adapt to versioning * adapt the `test_tokenizer_mismatch_warning` test * clean test * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py * Revert "remove `tokenizer_file` from mbart slow tokenizer init" This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1. * Revert "`"tokenizer_file"` from LED tokenizer's Global variables" This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2. * Revert "remove `tokenizer_file` from herbert slow tokenizer init" This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd. * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables" This reverts commit da0895330bedfafc81ae3073470a9348c669f032. * set `tokenizer_file` in super `__init__` of mbart
-
SaulLu authored
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__` * add test * fix kwargs * reformat test * format * format * fix typo to render the documentation
-
- 27 Jan, 2022 1 commit
-
-
SaulLu authored
* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com>
-
- 06 Jan, 2022 1 commit
-
-
Nicolas Patry authored
-
- 03 Jan, 2022 1 commit
-
-
Nicolas Patry authored
* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
-
- 30 Dec, 2021 1 commit
-
-
Nicolas Patry authored
* Fixing a pathological case for slow tokenizers * Update src/transformers/tokenization_utils.py
-
- 03 Dec, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Use new method to acquire tokenizers * Resolve TODOs. * Style * Fix * Enable do_lower_case in test_tokenize_special_tokens * Apply suggestion from code review * Fix mask token handling * Revert "Fix mask token handling" This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648. * Fix FNet mask token tokenization * Complete everything * Apply suggestions from code review
-
- 10 Nov, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Fix index out of range when padding * Apply suggestions from code review * Style
-
- 08 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Dynamic configs * Add config test * Better tests * Add tokenizer and test * Add to from_config * With save
-
- 02 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Update Transformers to huggingface_hub >= 0.1.0 * Forgot to save... * Style * Fix test
-
- 11 Oct, 2021 1 commit
-
-
Sylvain Gugger authored
* Honor existing attention mask in tokenzier.pad * Fix initialization of attention mask * Roll the implem on all subclasses * Fix tests
-
- 08 Oct, 2021 1 commit
-
-
Nicolas Patry authored
* Adding support for tokens being suffixes or part of each other. * Better test name.
-
- 05 Oct, 2021 1 commit
-
-
Nicolas Patry authored
-
- 17 Sep, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Fix special tokens not correctly tokenized * Add testing * Fix * Fix * Use user workflows instead of directly assigning variables * Enable test of fast tokenizers * Update test of canine tokenizer
-
- 09 Sep, 2021 1 commit
-
-
Nicolas Patry authored
* Moving slow tokenizer to the Trie world. * Adding more docstrings to the Trie. * Fixing doctest (incompatible wiht our format? ) * Update src/transformers/tokenization_utils.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding a lot more comment into the internals of this algorithm. * Cleaner doc. * Fixing the namings. * Update src/transformers/tokenization_utils.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * quality. * Fixing longest first match. * Small improvements to cuts + more test + canine resistant test. * Fixing fast test. Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 02 Sep, 2021 1 commit
-
-
Apoorv Garg authored
* correct order of overflowing_tokens for slow tokenizer (issue fix #13148) * python 3.9 requires sentencepiece version 0.1.94 or above * slicing of ids fixed in truncated_sequence() * Update setup.py * Correct order of overflowing tokens for pair of sentences * code reformatted * Update tokenization_utils_base.py * reformatting file * test to check single_input added * missing function restored * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * added an error message for pair of seq and longest_first strategy * test for pair_input modified * variable name corrected * fixed a typo in error message * requested changes implemented * required test added * Corrected the message to match test message * added error message for Luke Tokenizer * lost test recovered * docstring for truncate_sequences and prepare_for_model updated * docstring for luke tokenizer updated * updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING * aligned text and fixed puncuatations * improved style and quality of code * fixed error_msg in truncate_sequences * replaced encode_plus method with regular call method * clean up * rephrased the docstring
-
- 01 Sep, 2021 1 commit
-
-
SaulLu authored
* add test in trainer and test tokenizer saving wi th trainer * quality * reverse trainer changes * replace test in test_trainer by a test for all the tokenizers * format * add can_save_slow_tokenizer attribute to all tokenizers * fix Herbert * format * Change comment in error * add comments and a new assert * Update src/transformers/models/albert/tokenization_albert_fast.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change ValueError barthez * change ValueError BigBird * change ValueError Camembert * change ValueError Mbart50 * change ValueError Pegasus * change ValueError ReFormer * change ValueError T5 * change ValueError RoBERTa * XLNET fast * Update tests/test_tokenization_common.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change `assert` into `self.assertIn` * format Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 23 Aug, 2021 1 commit
-
-
SaulLu authored
Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056) * add test * add change in PretrainedTokenizerBase * change Luke * deactivate * add the possibility to add additional special tokens for M2M100 * format * add special test for canine * proposed changes for mbart * proposed changes for mbart50 * proposed changes for byt5 * proposed changes for canine * proposed changes for t5 * test fast and slow * remove comment * remove comment * add fast version for all tests * replace break by continue * add more comments * add check to avoid duplicates * remove comment * format * proposed change for wave2vec2 * reverse changes mbart * uncomment * format
-
- 17 Jul, 2021 1 commit
-
-
Tomohiro Endo authored
* Detect mismatch by analyzing config * Fix comment * Fix import * Update src/transformers/tokenization_utils_base.py Co-authored-by:
SaulLu <55560583+SaulLu@users.noreply.github.com> * Revise based on reviews * remove kwargs * Fix exception * Fix handling exception again * Disable mismatch test in PreTrainedTokenizerFast Co-authored-by:
SaulLu <55560583+SaulLu@users.noreply.github.com>
-
- 16 Jul, 2021 1 commit
-
-
SaulLu authored
* preserve type of `additional_special_tokens` in `special_token_map` * format * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-