"vscode:/vscode.git/clone" did not exist on "e064f081504ef935a0fef30d5ce7dce4c58bd38b"
- 24 Apr, 2023 1 commit
-
-
Lucain authored
* Test hf_hub 0.14.0rc1 * fix mocked tests * package version --------- Co-authored-by:
Sylvain Gugger <Sylvain.gugger@gmail.com> Co-authored-by:
testbot <lucainp@hf.co>
-
- 03 Apr, 2023 1 commit
-
-
Arthur authored
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
-
- 29 Mar, 2023 1 commit
-
-
Arthur authored
* add draft changes * fix failing wav2vec * style * make sure that the argument is saved + add tests * style * fixup * update test * default clean_up_tokenization_spaces to False for Bloom and Llama * Update code based on review Co-authored-by:
Nicolas Patry <patry.nicolas@gmail.com> * style * quality --------- Co-authored-by:
Nicolas Patry <patry.nicolas@gmail.com>
-
- 09 Mar, 2023 1 commit
-
-
Lucain authored
* Remove set_access_token usage + fail tests if FutureWarning * do not fail on FutureWarning in CI --------- Co-authored-by:testbot <lucainp@hf.co>
-
- 07 Feb, 2023 1 commit
-
-
Sylvain Gugger authored
* Remove mentions of flake8/isort * Clean up inits * Deall with all other inits * Last special rule for dummy files
-
- 06 Feb, 2023 1 commit
-
-
Sylvain Gugger authored
* Result of black 23.1 * Update target to Python 3.7 * Switch flake8 to ruff * Configure isort * Configure isort * Apply isort with line limit * Put the right black version * adapt black in check copies * Fix copies
-
- 02 Nov, 2022 1 commit
-
-
Ben Eyal authored
馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775) * Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped
-
- 25 Oct, 2022 1 commit
-
-
Yih-Dar authored
* Fix model-tokenizer mapping Co-authored-by:ydshieh <ydshieh@users.noreply.github.com>
-
- 14 Oct, 2022 1 commit
-
-
Sylvain Gugger authored
-
- 27 Sep, 2022 1 commit
-
-
Sylvain Gugger authored
* More tests for regression in cached non existence * Style
-
- 16 Sep, 2022 2 commits
-
-
Sylvain Gugger authored
-
Sylvain Gugger authored
* Fix tokenizer load from one file * Add a test * Style Co-authored-by:Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 15 Sep, 2022 1 commit
-
-
Sylvain Gugger authored
* Fix CI for custom tokenizers * Add nightly tests * Run CI, run! * Fix paths * Typos * Fix test
-
- 29 Aug, 2022 1 commit
-
-
Lucain authored
-
- 24 Aug, 2022 1 commit
-
-
SaulLu authored
add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693) * add warning to let the user know that the method is slower that for a fast tokenizer * user warnings * fix layoutlmv2 * fix layout* * change warnings into logger.warning
-
- 05 Aug, 2022 1 commit
-
-
Sylvain Gugger authored
* Draft new cached_file * Initial draft for config and model * Small fixes * Fix first batch of tests * Look in cache when internet is down * Fix last tests * Bad black, not fixing all quality errors * Make diff less * Implement change for TF and Flax models * Add tokenizer and feature extractor * For compatibility with main * Add utils to move the cache and auto-do it at first use. * Quality * Deal with empty commit shas * Deal with empty etag * Address review comments
-
- 01 Aug, 2022 1 commit
-
-
Sylvain Gugger authored
* Rewrite push_to_hub to use upload_files * Adapt the doc a bit * Address review comments and clean doc
-
- 11 Jul, 2022 1 commit
-
-
Yulv-git authored
* Fix some typos. Signed-off-by:
Yulv-git <yulvchi@qq.com> * Fix typo. Signed-off-by:
Yulv-git <yulvchi@qq.com> * make fixup.
-
- 23 Jun, 2022 1 commit
-
-
Guillaume Klein authored
Co-authored-by:SaulLu <55560583+SaulLu@users.noreply.github.com>
-
- 21 Jun, 2022 1 commit
-
-
Lysandre Debut authored
* Prepare CI for v0.8.0 * pin hfh (revert before merge) * Revert "pin hfh (revert before merge)" This reverts commit a0103140e1c77b810ffcb735192968bc03be3e1f. * Test rc3 * Test latest rc * Unpin to the RC Co-authored-by:Sylvain Gugger <Sylvain.gugger@gmail.com>
-
- 31 May, 2022 1 commit
-
-
Patrick von Platen authored
[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review
-
- 12 May, 2022 1 commit
-
-
Sylvain Gugger authored
* Black preview * Fixup too! * Fix check copies * Use the same version as the CI * Bump black
-
- 13 Apr, 2022 1 commit
-
-
davidleonfdez authored
* Fix setters of *_token_id properties of SpecialTokensMixin * Test setters of common tokens ids * Move to a separate test checks of setters of tokens ids * Add independent test for ByT5 * Add Canine test * Test speech to text
-
- 04 Apr, 2022 1 commit
-
-
SaulLu authored
* add new tests * add comment to overridden tests
-
- 23 Mar, 2022 1 commit
-
-
Sylvain Gugger authored
* Make Transformers use cache files when hf.co is down * Fix tests * Was there a random circleCI failure? * Isolate patches * Style * Comment out the failure since it doesn't fail anymore * Better comment
-
- 15 Feb, 2022 1 commit
-
-
Sylvain Gugger authored
* Allow custom code for Processors * Add more test * Test all auto_map configs are properly set
-
- 02 Feb, 2022 2 commits
-
-
SaulLu authored
* change truncation_side in init of `PreTrainedTokenizerBase` Co-authored-by:
LSinev <LSinev@users.noreply.github.com> * add test * Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`" This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843. * fix kwargs * Revert "fix kwargs" This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f. * Update tests/test_tokenization_common.py Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> * delete truncation_side variable * reorganize test * format * complete doc * Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"" This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80. * fix typo * fix typos to render documentation * Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""" This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0. * format Co-authored-by:
LSinev <LSinev@users.noreply.github.com> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Sylvain Gugger authored
* Allow dynamic modules to use relative imports * Work for configs * Fix last merge conflict * Save code of registered custom objects * Map strings to strings * Fix test * Add tokenizer * Rework tests * Tests * Ignore fixtures py files for tests * Tokenizer test + fix collection * With full path * Rework integration * Fix typo * Remove changes in conftest * Test for tokenizers * Add documentation * Update docs/source/custom_models.mdx Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Add file structure and file content * Add more doc * Style * Update docs/source/custom_models.mdx Co-authored-by:
Suraj Patil <surajp815@gmail.com> * Address review comments Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Suraj Patil <surajp815@gmail.com>
-
- 01 Feb, 2022 2 commits
-
-
SaulLu authored
fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319) * add new test * update test * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py` * add `tokenizer_file` for the fast only tokenizer * change global variables layoutxml * remove `"tokenizer_file"` from DPR tokenizer's Global variables * remove `tokenizer_file` from herbert slow tokenizer init * `"tokenizer_file"` from LED tokenizer's Global variables * remove `tokenizer_file` from mbart slow tokenizer init * remove `tokenizer_file` from slow tokenizer template * adapt to versioning * adapt the `test_tokenizer_mismatch_warning` test * clean test * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py * Revert "remove `tokenizer_file` from mbart slow tokenizer init" This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1. * Revert "`"tokenizer_file"` from LED tokenizer's Global variables" This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2. * Revert "remove `tokenizer_file` from herbert slow tokenizer init" This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd. * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables" This reverts commit da0895330bedfafc81ae3073470a9348c669f032. * set `tokenizer_file` in super `__init__` of mbart
-
SaulLu authored
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__` * add test * fix kwargs * reformat test * format * format * fix typo to render the documentation
-
- 27 Jan, 2022 1 commit
-
-
SaulLu authored
* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com>
-
- 06 Jan, 2022 1 commit
-
-
Nicolas Patry authored
-
- 03 Jan, 2022 1 commit
-
-
Nicolas Patry authored
* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
-
- 30 Dec, 2021 1 commit
-
-
Nicolas Patry authored
* Fixing a pathological case for slow tokenizers * Update src/transformers/tokenization_utils.py
-
- 03 Dec, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Use new method to acquire tokenizers * Resolve TODOs. * Style * Fix * Enable do_lower_case in test_tokenize_special_tokens * Apply suggestion from code review * Fix mask token handling * Revert "Fix mask token handling" This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648. * Fix FNet mask token tokenization * Complete everything * Apply suggestions from code review
-
- 10 Nov, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Fix index out of range when padding * Apply suggestions from code review * Style
-
- 08 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Dynamic configs * Add config test * Better tests * Add tokenizer and test * Add to from_config * With save
-
- 02 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Update Transformers to huggingface_hub >= 0.1.0 * Forgot to save... * Style * Fix test
-
- 11 Oct, 2021 1 commit
-
-
Sylvain Gugger authored
* Honor existing attention mask in tokenzier.pad * Fix initialization of attention mask * Roll the implem on all subclasses * Fix tests
-
- 08 Oct, 2021 1 commit
-
-
Nicolas Patry authored
* Adding support for tokens being suffixes or part of each other. * Better test name.
-