- 02 Feb, 2022 1 commit
-
-
Sylvain Gugger authored
* Allow dynamic modules to use relative imports * Work for configs * Fix last merge conflict * Save code of registered custom objects * Map strings to strings * Fix test * Add tokenizer * Rework tests * Tests * Ignore fixtures py files for tests * Tokenizer test + fix collection * With full path * Rework integration * Fix typo * Remove changes in conftest * Test for tokenizers * Add documentation * Update docs/source/custom_models.mdx Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Add file structure and file content * Add more doc * Style * Update docs/source/custom_models.mdx Co-authored-by:
Suraj Patil <surajp815@gmail.com> * Address review comments Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Suraj Patil <surajp815@gmail.com>
-
- 21 Jan, 2022 1 commit
-
-
Sylvain Gugger authored
* Refine errors for pretrained objects * PoC to avoid using get_list_of_files * Adapt tests to use new errors * Quality + Fix PoC * Revert "PoC to avoid using get_list_of_files" This reverts commit cb93b7cae8504ef837c2a7663cb7955e714f323e. * Revert "Quality + Fix PoC" This reverts commit 3ba6d0d4ca546708b31d355baa9e68ba9736508f. * Fix doc * Revert PoC * Add feature extractors * More tests and PT model * Adapt error message * Feature extractor tests * TF model * Flax model and test * Merge flax auto tests * Add tokenization * Fix test
-
- 03 Jan, 2022 1 commit
-
-
Nicolas Patry authored
* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by:
Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
-
- 23 Dec, 2021 1 commit
-
-
Sylvain Gugger authored
* Better logic for getting tokenizer config in AutoTokenizer * Remove needless import * Remove debug statement * Address review comments
-
- 18 Oct, 2021 1 commit
-
-
Sylvain Gugger authored
* Add API to register a new object in auto classes * Fix test * Documentation * Add to tokenizers and test * Add cleanup after tests * Be more careful * Move import * Move import * Cleanup in TF test too * Add consistency check * Add documentation * Style * Update docs/source/model_doc/auto.rst Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/auto/auto_factory.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 21 Sep, 2021 2 commits
-
-
Patrick von Platen authored
* up * up
-
Kamal Raj authored
-
- 30 Aug, 2021 1 commit
-
-
Sylvain Gugger authored
* Fix AutoTokenizer when a tokenizer has no fast version * Add test
-
- 26 Aug, 2021 1 commit
-
-
Stas Bekman authored
* fix tokenizer_class_from_name * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * add test Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 17 Jun, 2021 1 commit
-
-
Sylvain Gugger authored
* AutoTokenizer: infer the class from the tokenizer config if possible * Add tests * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com>
-
- 14 Jun, 2021 1 commit
-
-
SaulLu authored
* feature for tokenizer without slow/legacy version * format * modify common test * add tests * add PreTrainedTokenizerFast to AutoTokenizer * format * change tokenizer common test in order to be able to run test without a slow version * update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class` * add autokenizer test * replace `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None` * remove obsolete change in comment * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/tokenization_utils_fast.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change `get_main_tokenizer` into `get_tokenizers` * clarify `get_tokenizers` method * homogenize with `test_slow_tokenizer` and `test_rust_tokenizer` * add `test_rust_tokenizer = False` to tokenizer which don't define a fast version * `test_rust_tokenizer = False` for BertJapaneseTokenizer * `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 31 Mar, 2021 1 commit
-
-
Sylvain Gugger authored
* First third * Styling and fix mistake * Quality * All the rest * Treat %s and %d * typo * Missing ) * Apply suggestions from code review Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 19 Mar, 2021 1 commit
-
-
Théo Matussière authored
* fix backend tokenizer args override: key mismatch * no touching the docs * fix mpnet * add mpnet to test * fix test Co-authored-by:theo <theo@matussie.re>
-
- 07 Dec, 2020 1 commit
-
-
Sylvain Gugger authored
* Add copyright everywhere missing * Style
-
- 24 Nov, 2020 1 commit
-
-
Lysandre Debut authored
* MT5 should have an autotokenizer * Different configurations should be able to point to same tokenizers
-
- 17 Nov, 2020 2 commits
-
-
Sylvain Gugger authored
* Remove old deprecated arguments Co-authored-by:
LysandreJik <lysandre.debut@reseau.eseo.fr> * Remove needless imports * Fix tests Co-authored-by:
LysandreJik <lysandre.debut@reseau.eseo.fr>
-
Sylvain Gugger authored
* Put models in subfolders * Styling * Fix imports in tests * More fixes in test imports * Sneaky hidden imports * Fix imports in doc files * More sneaky imports * Finish fixing tests * Fix examples * Fix path for copies * More fixes for examples * Fix dummy files * More fixes for example * More model import fixes * Is this why you're unhappy GitHub? * Fix imports in conver command
-
- 15 Nov, 2020 1 commit
-
-
Thomas Wolf authored
[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073) * Fixing roberta for slow-fast tests * WIP getting equivalence on pipelines * slow-to-fast equivalence - working on question-answering pipeline * optional FAISS tests * Pipeline Q&A * Move pipeline tests to their own test job again * update tokenizer to add sequence id methods * update to tokenizers 0.9.4 * set sentencepiecce as optional * clean up squad * clean up pipelines to use sequence_ids * style/quality * wording * Switch to use_fast = True by default * update tests for use_fast at True by default * fix rag tokenizer test * removing protobuf from required dependencies * fix NER test for use_fast = True by default * fixing example tests (Q&A examples use slow tokenizers for now) * protobuf in main deps extras["sentencepiece"] and example deps * fix protobug install test * try to fix seq2seq by switching to slow tokenizers for now * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 22 Oct, 2020 1 commit
-
-
Stas Bekman authored
* slow tests should be slow * exception note * style * integrate LysandreJik's notes with some expansions * Apply suggestions from code review Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * another slow test * fix link, and prose * clarify. * note from Sam * typo Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 18 Oct, 2020 1 commit
-
-
Thomas Wolf authored
* splitting fast and slow tokenizers [WIP] * [WIP] splitting sentencepiece and tokenizers dependencies * update dummy objects * add name_or_path to models and tokenizers * prefix added to file names * prefix * styling + quality * spliting all the tokenizer files - sorting sentencepiece based ones * update tokenizer version up to 0.9.0 * remove hard dependency on sentencepiece
🎉 * and removed hard dependency on tokenizers🎉 * update conversion script * update missing models * fixing tests * move test_tokenization_fast to main tokenization tests - fix bugs * bump up tokenizers * fix bert_generation * update ad fix several tokenizers * keep sentencepiece in deps for now * fix funnel and deberta tests * fix fsmt * fix marian tests * fix layoutlm * fix squeezebert and gpt2 * fix T5 tokenization * fix xlnet tests * style * fix mbart * bump up tokenizers to 0.9.2 * fix model tests * fix tf models * fix seq2seq examples * fix tests without sentencepiece * fix slow => fast conversion without sentencepiece * update auto and bert generation tests * fix mbart tests * fix auto and common test without tokenizers * fix tests without tokenizers * clean up tests lighten up when tokenizers + sentencepiece are both off * style quality and tests fixing * add sentencepiece to doc/examples reqs * leave sentencepiece on for now * style quality split hebert and fix pegasus * WIP Herbert fast * add sample_text_no_unicode and fix hebert tokenization * skip FSMT example test for now * fix style * fix fsmt in example tests * update following Lysandre and Sylvain's comments * Update src/transformers/testing_utils.py Co-authored-by:Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/testing_utils.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 09 Sep, 2020 1 commit
-
-
Julien Chaumond authored
-
- 01 Jul, 2020 1 commit
-
-
Sam Shleifer authored
-
- 25 Jun, 2020 1 commit
-
-
Thomas Wolf authored
[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests
-
- 24 Feb, 2020 1 commit
-
-
Lysandre Debut authored
-
- 19 Feb, 2020 1 commit
-
-
Funtowicz Morgan authored
* Implemented fast version of tokenizers Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bumped tokenizers version requirements to latest 0.2.1 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added matching tests Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching OpenAI GPT tokenization ! Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching GPT2 on tokenizers Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Expose add_prefix_space as constructor parameter for GPT2 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching Roberta tokenization ! Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Removed fast implementation of CTRL. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Binding TransformerXL tokenizers to Rust. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Updating tests accordingly. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added tokenizers as top-level modules. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Black & isort. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Rename LookupTable to WordLevel to match Rust side. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Black. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Use "fast" suffix instead of "ru" for rust tokenizers implementations. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Introduce tokenize() method on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * encode_plus dispatchs to batch_encode_plus Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * batch_encode_plus now dispatchs to encode if there is only one input element. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bind all the encode_plus parameter to the forwarded batch_encode_plus call. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.3.0 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Formatting. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix tokenization_auto with support for new (python, fast) mapping schema. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Give correct fixtures path in test_tokenization_fast.py for the CLI. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Expose max_len_ properties on BertTokenizerFast Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * _convert_encoding should keep the batch axis tensor if only one sample in the batch. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Add warning message for RobertaTokenizerFast if used for MLM. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added use_fast (bool) parameter on AutoTokenizer.from_pretrained(). This allows to easily enable/disable Rust-based tokenizer instantiation. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Let's tokenizers handle all the truncation and padding stuff. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Allow to provide tokenizer arguments during pipeline creation. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Update test_fill_mask pipeline to not use fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix too much parameters for convert_encoding. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * When enabling padding, max_length should be set to None. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Avoid returning nested tensors of length 1 when calling encode_plus Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure output is padded when return_tensor is not None. Tensor creation requires the inital list input to be of the exact same size. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Disable transfoxl unittest if pytorch is not available (required to load the model) Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * encode_plus should not remove the leading batch axis if return_tensor is set Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Temporary disable fast tokenizers on QA pipelines. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix formatting issues. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Update tokenizers to 0.4.0 * Update style * Enable truncation + stride unit test on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Add unittest ensuring special_tokens set match between Python and Rust. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure special_tokens are correctly set during construction. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Give more warning feedback to the user in case of padding without pad_token. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * quality & format. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added possibility to add a single token as str Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added unittest for add_tokens and add_special_tokens on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix rebase mismatch on pipelines qa default model. QA requires cased input while the tokenizers would be uncased. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Using offset mapping relative to the original string + unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: save_vocabulary requires folder and file name Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Simplify import for Bert. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Remove private member access in tokenize() Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Bump tokenizers dependency to 0.4.2 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * format & quality. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Use named arguments when applicable. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Relax type checking to include tuple and list object. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Document the truncate_and_pad manager behavior. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Raise an exception if return_offsets_mapping is not available with the current tokenizer. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure padding is set on the tokenizers before setting any padding strategy + unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * On pytorch we need to stack tensor to get proper new axis. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Generalize tests to different framework removing hard written return_tensors="..." Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizer dependency for num_special_tokens_to_add Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Overflowing tokens in batch_encode_plus are now stacked over the batch axis. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Improved error message for padding strategy without pad token. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bumping tokenizers dependency to 0.5.0 for release. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Optimizing convert_encoding around 4x improvement.
🚀 Signed-off-by:Morgan Funtowicz <morgan@huggingface.co> * expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix unittests for overflow_to_sampling_mapping not being returned as tensor. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Format & quality. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove perfect alignment constraint for Roberta (allowing 1% difference max) Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Triggering final CI Co-authored-by:
MOI Anthony <xn1t0x@gmail.com>
-
- 31 Jan, 2020 1 commit
-
-
Lysandre authored
cc @julien-c
-
- 16 Jan, 2020 2 commits
-
-
Julien Chaumond authored
-
Julien Chaumond authored
-
- 14 Jan, 2020 4 commits
-
-
Julien Chaumond authored
-
Julien Chaumond authored
-
Julien Chaumond authored
-
Julien Chaumond authored
-
- 11 Jan, 2020 3 commits
-
-
Julien Chaumond authored
-
Julien Chaumond authored
-
Julien Chaumond authored
-
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 22 Dec, 2019 3 commits
-
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This construct isn't used anymore these days. Running python tests/test_foo.py puts the tests/ directory on PYTHONPATH, which isn't representative of how we run tests. Use python -m unittest tests/test_foo.py instead.
-
Aymeric Augustin authored
-