- 27 Jan, 2022 1 commit
-
-
SaulLu authored
* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by:
Sylvain Gugger <sylvain.gugger@gmail.com>
-
- 06 Jan, 2022 1 commit
-
-
Nicolas Patry authored
-
- 03 Jan, 2022 1 commit
-
-
Nicolas Patry authored
* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
-
- 30 Dec, 2021 1 commit
-
-
Nicolas Patry authored
* Fixing a pathological case for slow tokenizers * Update src/transformers/tokenization_utils.py
-
- 03 Dec, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Use new method to acquire tokenizers * Resolve TODOs. * Style * Fix * Enable do_lower_case in test_tokenize_special_tokens * Apply suggestion from code review * Fix mask token handling * Revert "Fix mask token handling" This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648. * Fix FNet mask token tokenization * Complete everything * Apply suggestions from code review
-
- 10 Nov, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Fix index out of range when padding * Apply suggestions from code review * Style
-
- 08 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Dynamic configs * Add config test * Better tests * Add tokenizer and test * Add to from_config * With save
-
- 02 Nov, 2021 1 commit
-
-
Sylvain Gugger authored
* Update Transformers to huggingface_hub >= 0.1.0 * Forgot to save... * Style * Fix test
-
- 11 Oct, 2021 1 commit
-
-
Sylvain Gugger authored
* Honor existing attention mask in tokenzier.pad * Fix initialization of attention mask * Roll the implem on all subclasses * Fix tests
-
- 08 Oct, 2021 1 commit
-
-
Nicolas Patry authored
* Adding support for tokens being suffixes or part of each other. * Better test name.
-
- 05 Oct, 2021 1 commit
-
-
Nicolas Patry authored
-
- 17 Sep, 2021 1 commit
-
-
Li-Huai (Allan) Lin authored
* Fix special tokens not correctly tokenized * Add testing * Fix * Fix * Use user workflows instead of directly assigning variables * Enable test of fast tokenizers * Update test of canine tokenizer
-
- 09 Sep, 2021 1 commit
-
-
Nicolas Patry authored
* Moving slow tokenizer to the Trie world. * Adding more docstrings to the Trie. * Fixing doctest (incompatible wiht our format? ) * Update src/transformers/tokenization_utils.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding a lot more comment into the internals of this algorithm. * Cleaner doc. * Fixing the namings. * Update src/transformers/tokenization_utils.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * quality. * Fixing longest first match. * Small improvements to cuts + more test + canine resistant test. * Fixing fast test. Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 02 Sep, 2021 1 commit
-
-
Apoorv Garg authored
* correct order of overflowing_tokens for slow tokenizer (issue fix #13148) * python 3.9 requires sentencepiece version 0.1.94 or above * slicing of ids fixed in truncated_sequence() * Update setup.py * Correct order of overflowing tokens for pair of sentences * code reformatted * Update tokenization_utils_base.py * reformatting file * test to check single_input added * missing function restored * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * added an error message for pair of seq and longest_first strategy * test for pair_input modified * variable name corrected * fixed a typo in error message * requested changes implemented * required test added * Corrected the message to match test message * added error message for Luke Tokenizer * lost test recovered * docstring for truncate_sequences and prepare_for_model updated * docstring for luke tokenizer updated * updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING * aligned text and fixed puncuatations * improved style and quality of code * fixed error_msg in truncate_sequences * replaced encode_plus method with regular call method * clean up * rephrased the docstring
-
- 01 Sep, 2021 1 commit
-
-
SaulLu authored
* add test in trainer and test tokenizer saving wi th trainer * quality * reverse trainer changes * replace test in test_trainer by a test for all the tokenizers * format * add can_save_slow_tokenizer attribute to all tokenizers * fix Herbert * format * Change comment in error * add comments and a new assert * Update src/transformers/models/albert/tokenization_albert_fast.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change ValueError barthez * change ValueError BigBird * change ValueError Camembert * change ValueError Mbart50 * change ValueError Pegasus * change ValueError ReFormer * change ValueError T5 * change ValueError RoBERTa * XLNET fast * Update tests/test_tokenization_common.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change `assert` into `self.assertIn` * format Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 23 Aug, 2021 1 commit
-
-
SaulLu authored
Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056) * add test * add change in PretrainedTokenizerBase * change Luke * deactivate * add the possibility to add additional special tokens for M2M100 * format * add special test for canine * proposed changes for mbart * proposed changes for mbart50 * proposed changes for byt5 * proposed changes for canine * proposed changes for t5 * test fast and slow * remove comment * remove comment * add fast version for all tests * replace break by continue * add more comments * add check to avoid duplicates * remove comment * format * proposed change for wave2vec2 * reverse changes mbart * uncomment * format
-
- 17 Jul, 2021 1 commit
-
-
Tomohiro Endo authored
* Detect mismatch by analyzing config * Fix comment * Fix import * Update src/transformers/tokenization_utils_base.py Co-authored-by:
SaulLu <55560583+SaulLu@users.noreply.github.com> * Revise based on reviews * remove kwargs * Fix exception * Fix handling exception again * Disable mismatch test in PreTrainedTokenizerFast Co-authored-by:
SaulLu <55560583+SaulLu@users.noreply.github.com>
-
- 16 Jul, 2021 1 commit
-
-
SaulLu authored
* preserve type of `additional_special_tokens` in `special_token_map` * format * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 01 Jul, 2021 1 commit
-
-
SaulLu authored
* add a test for a WordLevel tokenizer * adapt common test to new tokenizer
-
- 29 Jun, 2021 1 commit
-
-
Sylvain Gugger authored
* [WIP] Easily train a new fast tokenizer from a given one * Fix test * Roll out to other tokenizers and add tests * Fix bug with unk id and add emoji to test * Really use something different in test * Implement special tokens map * Map special tokens in the Transformers tokenizers * Fix test * Make test more robust * Fix test for BPE * More robust map and test Co-authored-by SaulLu * Test file * Stronger tests Co-authored-by:
SaulLu <lucilesaul.com@gmail.com> * Map unk token for Wordpiece and address review comment * Fix lowercase test and address review comment * Fix all tests * Simplify test * Fix tests for realsies * Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420) * Propose change in tests regarding lower case * add new test for special tokens types * put back the test part about decoding * add feature: the AddedToken is re-build with the different mapped content * Address review comment: simplify AddedToken building Co-authored-by:
sgugger <sylvain.gugger@gmail.com> * Update src/transformers/tokenization_utils_fast.py Co-authored-by:
sgugger <sylvain.gugger@gmail.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
SaulLu <lucilesaul.com@gmail.com> Co-authored-by:
SaulLu <55560583+SaulLu@users.noreply.github.com>
-
- 23 Jun, 2021 1 commit
-
-
Sylvain Gugger authored
* Clean push to hub API * Create working dir if it does not exist * Different tweak * New API + all models + test Flax * Adds the Trainer clean up * Update src/transformers/file_utils.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Address review comments * (nit) output types * No need to set clone_from when folder exists * Update src/transformers/trainer.py Co-authored-by:
Julien Chaumond <julien@huggingface.co> * Add generated_from_trainer tag * Update to new version * Fixes Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Julien Chaumond <julien@huggingface.co> Co-authored-by:
Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 14 Jun, 2021 1 commit
-
-
SaulLu authored
* feature for tokenizer without slow/legacy version * format * modify common test * add tests * add PreTrainedTokenizerFast to AutoTokenizer * format * change tokenizer common test in order to be able to run test without a slow version * update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class` * add autokenizer test * replace `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None` * remove obsolete change in comment * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/tokenization_utils_fast.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change `get_main_tokenizer` into `get_tokenizers` * clarify `get_tokenizers` method * homogenize with `test_slow_tokenizer` and `test_rust_tokenizer` * add `test_rust_tokenizer = False` to tokenizer which don't define a fast version * `test_rust_tokenizer = False` for BertJapaneseTokenizer * `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 07 Jun, 2021 1 commit
-
-
Philip May authored
-
- 01 Jun, 2021 1 commit
-
-
Philip May authored
* add test_vocab_size for sentencepiece tok. * add test_get_vocab for sentencepiece tok. * add test_convert_token_and_id for sentencepiece tok. * add test_tokenize_and_convert_tokens_to_string for all tok. * improve test_tokenize_and_convert_tokens_to_string for sp. tok. * add common tokenizer integration tests - for albert - for barthez * add tokenizer integration tests to bert gen. * add most tokenizer integration tests * fix camembert tokenizer integration test * add tokenizer integration test to marian * add tokenizer integration test to reformer * add typing and doc to tokenizer_integration_test_util * fix tokenizer integration test of reformer * improve test_sentencepiece_tokenize_and_convert_tokens_to_string * empty commit to trigger CI * fix tokenizer integration test of reformer * remove code not needed anymore * empty commit to trigger CI * empty commit to trigger CI
-
- 13 May, 2021 1 commit
-
-
Philip May authored
* improve slow class tok usage at xlm rob * add subword regularization for barthez * improve barthez tok. test * fix tokenizer tests * add subword regularization for camembert * add subword regularization for deberta v2 tokenizer * add more doc to deberta v2 tokenizer * add subword regularization for speech to text tok. * fix sp_model_kwargs type in speech 2 text tok. * add subword regularization for M2M100 tok. * add more concrete type hints * fix tests for m2m100 and s2t tok. * add missing Any import * fix syntax error in m2m100 tok. * fix unpickle of m2m100 and s2t tok. * fix test of m2m100 and s2t tok. * improve unpickle of deberta v2 tok. * add test for pickle of barthez & camembert * fix pickle of barthez & camembert * add test for deberta v2 tok. pickle * fix m2m100 tok. pickle * fix s2t tok. pickle * add subword regularization to albert tok. * refactor subword reg. test into TokenizerTesterMixin improve albert tok. test remove sample argument form albert tok. check subword reg. using TokenizerTesterMixin improve tok. tests improve xlm roberta tok. tests improve xlm roberta tok. tests * add subword regularization for big bird t. * improve xlm roberta tok. test * add subword regularization for mbart50 tok. * add subword regularization for pegasus tok. * add subword regularization for reformer tok. * add subword regularization for T5 tok. * fix t5 tok. test formatting * add subword regularization for xlm_proph. tok. * add subword regularization for xlnet tok. * add subword regularization for gert_gen tok. * add typing to tokenizers * add typing to xlm rob. tok * add subword regularization for marian tok. * add reverse tok. test * fix marian tok test * fix marian tok test * fix casing in tok. tests * fix style of tok. common test * fix deberta v2 tok test * add type annotations to tok. tests * add type annotations to tok. __init__ * add typing to kokenizer * add type annotations to tok. __init__ * don't specify the default when it's None * fix barthez tok. doc * move sentencepiece tok. tests to TokenizerTesterMixin * fix unused imports * fix albert tok. test * add comment to sentencepiece test options * fix Any import at big bird tok. * fix Any import at xlm prophetnet tok. * empty commit to trigger CI
-
- 04 May, 2021 1 commit
-
-
Lysandre Debut authored
* Fix tests * Reorganize * Update tests/test_modeling_mobilebert.py * Remove unnecessary addition
-
- 26 Apr, 2021 2 commits
-
-
Sylvain Gugger authored
-
Patrick von Platen authored
-
- 23 Apr, 2021 1 commit
-
-
Sylvain Gugger authored
* Initial support for upload to hub * push -> upload * Fixes + examples * Fix torchhub test * Torchhub test I hate you * push_model_to_hub -> push_to_hub * Apply mixin to other pretrained models * Remove ABC inheritance * Add tests * Typo * Run tests * Install git-lfs * Change approach * Add push_to_hub to all * Staging test suite * Typo * Maybe like this? * More deps * Cache * Adapt name * Quality * MOAR tests * Put it in testing_utils * Docs + torchhub last hope * Styling * Wrong method * Typos * Update src/transformers/file_utils.py Co-authored-by:
Julien Chaumond <julien@huggingface.co> * Address review comments * Apply suggestions from code review Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by:
Julien Chaumond <julien@huggingface.co> Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com>
-
- 15 Apr, 2021 1 commit
-
-
Sylvain Gugger authored
* Save fast tokenizers in both formats * Fix for HerBERT * Proper fix * Properly test new behavior
-
- 05 Apr, 2021 1 commit
-
-
Lysandre Debut authored
-
- 31 Mar, 2021 1 commit
-
-
Sylvain Gugger authored
* First third * Styling and fix mistake * Quality * All the rest * Treat %s and %d * typo * Missing ) * Apply suggestions from code review Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 16 Mar, 2021 1 commit
-
-
Patrick von Platen authored
* make flax tests pytorch independent * fix typo * finish * improve circle ci * fix return tensors * correct flax test * re-add sentencepiece * last tokenizer fixes * finish maybe now
-
- 25 Feb, 2021 1 commit
-
-
Sylvain Gugger authored
* Make Barthez tokenizer tests a bit faster * Quality
-
- 02 Feb, 2021 1 commit
-
-
Patrick von Platen authored
* change tokenizer requirement * split line * Correct typo from list to str * improve style * make other function pretty as well * add comment * correct typo * add new test * pass tests for tok without padding token * Apply suggestions from code review
-
- 14 Jan, 2021 1 commit
-
-
Lysandre Debut authored
-
- 12 Jan, 2021 1 commit
-
-
Sylvain Gugger authored
* Add target contextmanager and rework prepare_seq2seq_batch * Fix tests, treat BART and Barthez * Add last tokenizers * Fix test * Set src token before calling the superclass * Remove special behavior for T5 * Remove needless imports * Remove needless asserts
-
- 15 Dec, 2020 1 commit
-
-
NielsRogge authored
* First commit: adding all files from tapas_v3 * Fix multiple bugs including soft dependency and new structure of the library * Improve testing by adding torch_device to inputs and adding dependency on scatter * Use Python 3 inheritance rather than Python 2 * First draft model cards of base sized models * Remove model cards as they are already on the hub * Fix multiple bugs with integration tests * All model integration tests pass * Remove print statement * Add test for convert_logits_to_predictions method of TapasTokenizer * Incorporate suggestions by Google authors * Fix remaining tests * Change position embeddings sizes to 512 instead of 1024 * Comment out positional embedding sizes * Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES * Added more model names * Fix truncation when no max length is specified * Disable torchscript test * Make style & make quality * Quality * Address CI needs * Test the Masked LM model * Fix the masked LM model * Truncate when overflowing * More much needed docs improvements * Fix some URLs * Some more docs improvements * Test PyTorch scatter * Set to slow + minify * Calm flake8 down * First commit: adding all files from tapas_v3 * Fix multiple bugs including soft dependency and new structure of the library * Improve testing by adding torch_device to inputs and adding dependency on scatter * Use Python 3 inheritance rather than Python 2 * First draft model cards of base sized models * Remove model cards as they are already on the hub * Fix multiple bugs with integration tests * All model integration tests pass * Remove print statement * Add test for convert_logits_to_predictions method of TapasTokenizer * Incorporate suggestions by Google authors * Fix remaining tests * Change position embeddings sizes to 512 instead of 1024 * Comment out positional embedding sizes * Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES * Added more model names * Fix truncation when no max length is specified * Disable torchscript test * Make style & make quality * Quality * Address CI needs * Test the Masked LM model * Fix the masked LM model * Truncate when overflowing * More much needed docs improvements * Fix some URLs * Some more docs improvements * Add add_pooling_layer argument to TapasModel Fix comments by @sgugger and @patrickvonplaten * Fix issue in docs + fix style and quality * Clean up conversion script and add task parameter to TapasConfig * Revert the task parameter of TapasConfig Some minor fixes * Improve conversion script and add test for absolute position embeddings * Improve conversion script and add test for absolute position embeddings * Fix bug with reset_position_index_per_cell arg of the conversion cli * Add notebooks to the examples directory and fix style and quality * Apply suggestions from code review * Move from `nielsr/` to `google/` namespace * Apply Sylvain's comments Co-authored-by:
sgugger <sylvain.gugger@gmail.com> Co-authored-by:
Rogge Niels <niels.rogge@howest.be> Co-authored-by:
LysandreJik <lysandre.debut@reseau.eseo.fr> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
sgugger <sylvain.gugger@gmail.com>
-
- 02 Dec, 2020 1 commit
-
-
Nicolas Patry authored
* Warning about too long input for fast tokenizers too If truncation is not set in tokenizers, but the tokenization is too long for the model (`model_max_length`), we used to trigger a warning that The input would probably fail (which it most likely will). This PR re-enables the warning for fast tokenizers too and uses common code for the trigger to make sure it's consistent across. * Checking for pair of inputs too. * Making the function private and adding it's doc. * Remove formatting ?? in odd place. * Missed uppercase.
-
- 17 Nov, 2020 1 commit
-
-
Lysandre Debut authored
* Tokenizers should be framework agnostic * Run the slow tests * Not testing * Fix documentation * Apply suggestions from code review Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com>
-