- 15 Sep, 2020 1 commit
-
-
Sam Shleifer authored
-
- 14 Sep, 2020 1 commit
-
-
Stas Bekman authored
* fix deprecation warnings * remove tests/test_tokenization_common.py's test_padding_to_max_length * revert test_padding_to_max_length
-
- 09 Sep, 2020 1 commit
-
-
Lysandre Debut authored
Batch encore plus and overflowing tokens fails when non existing overflowing tokens for a sequence (#6677) * Patch and test * Fix tests
-
- 28 Aug, 2020 1 commit
-
-
Sam Shleifer authored
* broken test * batch parity * tests pass * boom boom * boom boom * split out bart tokenizer tests * fix tests * boom boom * Fixed dataset bug * Fix marian * Undo extra * Get marian working * Fix t5 tok tests * Test passing * Cleanup * better assert msg * require torch * Fix mbart tests * undo extra decoder_attn_mask change * Fix import * pegasus tokenizer can ignore src_lang kwargs * unused kwarg test cov * boom boom * add todo for pegasus issue * cover one word translation edge case * Cleanup * doc
-
- 26 Aug, 2020 1 commit
-
-
Lysandre authored
-
- 24 Aug, 2020 1 commit
-
-
Sylvain Gugger authored
* Run new isort * More changes * Update CI, CONTRIBUTING and benchmarks
-
- 11 Aug, 2020 2 commits
-
-
Sam Shleifer authored
-
Junyuan Zheng authored
* fix tokenizer saving and loading bugs when adding AddedToken to additional special tokens * Add tokenizer test * Style * Style 2 Co-authored-by:Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 04 Aug, 2020 1 commit
-
-
Sam Shleifer authored
-
- 07 Jul, 2020 1 commit
-
-
Sam Shleifer authored
improve unittests for finetuning, especially w.r.t testing frozen parameters fix freeze_embeds for T5 add streamlit setup.cfg
-
- 06 Jul, 2020 1 commit
-
-
Anthony MOI authored
* BertTokenizerFast - Do not specify strip_accents by default * Bump tokenizers to new version * Add test for AddedToken serialization
-
- 03 Jul, 2020 1 commit
-
-
Lysandre Debut authored
* Exposing prepare_for_model for both slow & fast tokenizers * Update method signature * The traditional style commit * Hide the warnings behind the verbose flag * update default truncation strategy and prepare_for_model * fix tests and prepare_for_models methods Co-authored-by:Thomas Wolf <thomwolf@users.noreply.github.com>
-
- 01 Jul, 2020 1 commit
-
-
Sam Shleifer authored
-
- 26 Jun, 2020 1 commit
-
-
Funtowicz Morgan authored
* Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by:
Thomas Wolf <thomwolf@users.noreply.github.com>
-
- 25 Jun, 2020 1 commit
-
-
Thomas Wolf authored
[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests
-
- 24 Jun, 2020 1 commit
-
-
Thomas Wolf authored
* update tests for fast tokenizers + fix small bug in saving/loading * better tests on serialization * fixing serialization * comment cleanup
-
- 23 Jun, 2020 1 commit
-
-
Thomas Wolf authored
* Add return lengths * make pad a bit more flexible so it can be used as collate_fn * check all kwargs sent to encoding method are known * fixing kwargs in encodings * New AddedToken class in python This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens. * style and quality * switched to hugginface tokenizers library for AddedTokens * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state * style and quality * do not raise an error on additional or unused kwargs for tokenize() but only a warning * transfo-xl pretrained model requires torch * Update src/transformers/tokenization_utils.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 22 Jun, 2020 1 commit
-
-
Thomas Wolf authored
* fix #5081 and improve backward compatibility (slightly) * add nlp to setup.cfg - style and quality * align default to previous default * remove test that doesn't generalize
-
- 15 Jun, 2020 1 commit
-
-
Anthony MOI authored
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by:
Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by:
Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 04 Jun, 2020 1 commit
-
-
Funtowicz Morgan authored
* Refactor tensor creation in tokenizers. * Make sure to convert string to TensorType * Refactor convert_to_tensors_ * Introduce numpy tensor creation * Format * Add unittest for TensorType creation from str * sorting imports * Added unittests for numpy tensor conversion. * Do not use in-place version for squeeze as numpy doesn't provide such feature. * Added extra parameter prepend_batch_axis: bool on prepare_for_model. * Ensure test_np_encode_plus_sent_to_model is not executed if encoder/decoder model. * style. * numpy tests require_torch for now while flax not merged. * Hopefully will make flake8 happy. * One more time
馃幎
-
- 19 May, 2020 1 commit
-
-
Sam Shleifer authored
-
- 14 May, 2020 1 commit
-
-
Julien Chaumond authored
* Fix: unpin flake8 and fix cs errors * Ok we still need to quote those
-
- 01 May, 2020 1 commit
-
-
Julien Chaumond authored
-
- 09 Apr, 2020 1 commit
-
-
LysandreJik authored
cc @julien-c
-
- 08 Apr, 2020 1 commit
-
-
Lysandre Debut authored
* Updating modeling tf files; adding tests * Merge `encode_plus` and `batch_encode_plus`
-
- 06 Apr, 2020 1 commit
-
-
Funtowicz Morgan authored
* Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Style * Style
馃檹 馃檹 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by:LysandreJik <lysandre.debut@reseau.eseo.fr>
-
- 09 Mar, 2020 1 commit
-
-
Lysandre Debut authored
* Minimal example * Proposal 2 * Proposal 2 for fast tokenizers * Typings * Docs * Revert "Docs" for easier review This reverts commit eaf0f97062e809887704a542144c537f769d5223. * Remove unnecessary assignments * Tests * Fix faulty type * Remove prints * return_outputs -> model_input_names * Revert "Revert "Docs" for easier review" This reverts commit 6fdc69408102bf695797f2dfddbb6350c6b9e722. * code quality
-
- 02 Mar, 2020 1 commit
-
-
Patrick von Platen authored
* force pad_token_id to be set before padding * fix tests and forbid padding without having a padding_token_id set
-
- 24 Feb, 2020 1 commit
-
-
Lysandre Debut authored
* Testing that encode_plus and batch_encode_plus behave the same way Spoiler alert: they don't * Testing rest of arguments in batch_encode_plus * Test tensor return in batch_encode_plus * Addressing Sam's comments * flake8 * Simplified with `num_added_tokens`
-
- 20 Feb, 2020 1 commit
-
-
Joe Davison authored
-
- 13 Feb, 2020 1 commit
-
-
Joe Davison authored
* Preserve spaces in GPT-2 tokenizers Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa) tokenizers, enabling correct BPE encoding. Automatically inserts a space in front of first token in encode function when adding special tokens. * Add tokenization preprocessing method * Add framework argument to pipeline factory Also fixes pipeline test issue. Each test input now treated as a distinct sequence.
-
- 29 Jan, 2020 2 commits
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 24 Dec, 2019 1 commit
-
-
Anthony MOI authored
-
- 23 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
-
- 22 Dec, 2019 3 commits
-
-
Aymeric Augustin authored
On Python 3, `open is io.open`.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-