Commits · b0f05e0c4ced7991fef989a817b05051992bc415 · chenpangpang / transformers

09 Oct, 2020 1 commit
- [pegasus] Faster tokenizer tests (#7672) · b0f05e0c
  Stas Bekman authored Oct 09, 2020
  
  b0f05e0c
08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

22 Sep, 2020 1 commit
- is_pretokenized -> is_split_into_words (#7236) · 21ca1480
  Sylvain Gugger authored Sep 22, 2020
```
* is_pretokenized -> is_split_into_words

* Fix tests
```
  21ca1480
15 Sep, 2020 1 commit
- [QOL] add signature for prepare_seq2seq_batch (#7108) · 9e89390c
  Sam Shleifer authored Sep 14, 2020
  
  9e89390c
26 Aug, 2020 1 commit

Centralize logging (#6434) · 77abd1e7

Lysandre Debut authored Aug 26, 2020



* Logging

* Style

* hf_logging > utils.logging

* Address @thomwolf's comments

* Update test

* Update src/transformers/benchmark/benchmark_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Revert bad change
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

77abd1e7

25 Aug, 2020 1 commit
- Add typing.overload for convert_ids_tokens (#6637) · 841f0715
  Yohei Tamura authored Aug 25, 2020
```
* add overload for type checker

* black
```
  841f0715
14 Aug, 2020 1 commit
- Sort unique_no_split_tokens to make it deterministic (#6461) · 9a8c168f
  Quentin Lhoest authored Aug 14, 2020
```
* change unique_no_split_tokens's type to set

* use sorted list instead of set

* style
```
  9a8c168f
30 Jul, 2020 1 commit

Doc tokenizer (#6110) · f3065abd

Sylvain Gugger authored Jul 30, 2020



* Start doc tokenizers

* Tokenizer documentation

* Start doc tokenizers

* Tokenizer documentation

* Formatting after rebase

* Formatting after merge

* Update docs/source/main_classes/tokenizer.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address comment

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address Thom's comments
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

f3065abd

06 Jul, 2020 2 commits
- Fix #5544 (#5551) · 7833b21a
  Sylvain Gugger authored Jul 06, 2020
  
  7833b21a
- Fix the tokenization warning noted in #5505 (#5550) · c4734840
  Thomas Wolf authored Jul 06, 2020
```
* fix warning

* style and quality
```
  c4734840
03 Jul, 2020 1 commit

Exposing prepare_for_model for both slow & fast tokenizers (#5479) · 17ade127

Lysandre Debut authored Jul 03, 2020



* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

17ade127

26 Jun, 2020 2 commits

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

Add pad_to_multiple_of on tokenizers (reimport) (#5054) · 135791e8

Funtowicz Morgan authored Jun 26, 2020



* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

135791e8

25 Jun, 2020 1 commit

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default... · 27cf1d97

Thomas Wolf authored Jun 25, 2020

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252)

* fix-5181

Padding to max sequence length while truncation to another length was wrong on slow tokenizers

* clean up and fix #5155

* fix XLM test

* Fix tests for Transfo-XL

* logging only above WARNING in tests

* switch slow tokenizers tests in @slow

* fix Marian truncation tokenization test

* style and quality

* make the test a lot faster by limiting the sequence length used in tests

27cf1d97

24 Jun, 2020 1 commit

Add more tests on tokenizers serialization - fix bugs (#5056) · 7ac91107

Thomas Wolf authored Jun 24, 2020

* update tests for fast tokenizers + fix small bug in saving/loading

* better tests on serialization

* fixing serialization

* comment cleanup

7ac91107

23 Jun, 2020 2 commits

More clear error message in the use-case of #5169 (#5184) · b28b5371
Thomas Wolf authored Jun 23, 2020

b28b5371

Tokenizers API developments (#5103) · 11fdde02

Thomas Wolf authored Jun 23, 2020



* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

11fdde02

22 Jun, 2020 1 commit

[tokenizers] Fix #5081 and improve backward compatibility (#5125) · ebc36108

Thomas Wolf authored Jun 22, 2020

* fix #5081 and improve backward compatibility (slightly)

* add nlp to setup.cfg - style and quality

* align default to previous default

* remove test that doesn't generalize

ebc36108

15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

09 Jun, 2020 1 commit
- Fix the __getattr__ method in BatchEncoding (#4772) · 9f5d5a53
  Julien Plu authored Jun 09, 2020
  
  9f5d5a53
04 Jun, 2020 2 commits

Don't access pad_token_id if there is no pad_token (#4773) · 12d0eb5f
Sylvain Gugger authored Jun 04, 2020

12d0eb5f

Introduce a new tensor type for return_tensors on tokenizer for NumPy (#4585) · 5bf9afbf

Funtowicz Morgan authored Jun 04, 2020

* Refactor tensor creation in tokenizers.

* Make sure to convert string to TensorType

* Refactor convert_to_tensors_

* Introduce numpy tensor creation

* Format

* Add unittest for TensorType creation from str

* sorting imports

* Added unittests for numpy tensor conversion.

* Do not use in-place version for squeeze as numpy doesn't provide such feature.

* Added extra parameter prepend_batch_axis: bool on prepare_for_model.

* Ensure test_np_encode_plus_sent_to_model is not executed if encoder/decoder model.

* style.

* numpy tests require_torch for now while flax not merged.

* Hopefully will make flake8 happy.

* One more time 🎶

5bf9afbf

03 Jun, 2020 1 commit
- Update encode documentation (#4751) · 2e4de762
  Lysandre Debut authored Jun 03, 2020
  
  2e4de762
02 Jun, 2020 2 commits

Kill model archive maps (#4636) · d4c2cb40

Julien Chaumond authored Jun 02, 2020

* Kill model archive maps

* Fixup

* Also kill model_archive_map for MaskedBertPreTrainedModel

* Unhook config_archive_map

* Tokenizers: align with model id changes

* make style && make quality

* Fix CI

d4c2cb40

Override get_vocab for fast tokenizer. (#4717) · f6d5046a
Funtowicz Morgan authored Jun 02, 2020

f6d5046a

28 May, 2020 1 commit
- Fix add_special_tokens on fast tokenizers (#4531) · 5e737018
  Anthony MOI authored May 28, 2020
  
  5e737018
25 May, 2020 1 commit
- fixing tokenization of extra_id symbols in T5Tokenizer. Related to issue 4021 (#4353) · 3dea40b8
  Elman Mansimov authored May 25, 2020
  
  3dea40b8
22 May, 2020 1 commit

Warn the user about max_len being on the path to be deprecated. (#4528) · 996f393a

Funtowicz Morgan authored May 22, 2020

* Warn the user about max_len being on the path to be deprecated.

* Ensure better backward compatibility when max_len is provided to a tokenizer.

* Make sure to override the parameter and not the actual instance value.

* Format & quality

996f393a

19 May, 2020 1 commit
- [cleanup] test_tokenization_common.py (#4390) · 07dd7c2f
  Sam Shleifer authored May 19, 2020
  
  07dd7c2f
14 May, 2020 1 commit
- Tokenizer.batch_decode convenience method (#4159) · 9535bf19
  Sam Shleifer authored May 14, 2020
  
  9535bf19
13 May, 2020 1 commit

Fix for #3865. PretrainedTokenizer mapped " do not" into " don't" when... · 1e51bb71

Denis authored May 13, 2020

Fix for #3865. PretrainedTokenizer mapped " do not" into " don't" when .decode(...) is called. Removed the " do not" --> " don't" mapping from clean_up_tokenization(...). (#4024)

1e51bb71

12 May, 2020 1 commit

Allow BatchEncoding to be initialized empty. (#4316) · 7d7fe499

Funtowicz Morgan authored May 12, 2020

* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.

7d7fe499

07 May, 2020 1 commit
- Ensure fast tokenizer can construct tensor without pad token if only one... · 026097b9
  Funtowicz Morgan authored May 07, 2020
```
Ensure fast tokenizer can construct tensor without pad token if only one sample is provided. (#4201)
```
  026097b9
29 Apr, 2020 1 commit

CDN urls (#4030) · 455c6390

Julien Chaumond authored Apr 28, 2020

* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese

455c6390

28 Apr, 2020 2 commits

Allow a more backward compatible behavior of max_len_single_sentence and... · 8ba4c588

Thomas Wolf authored Apr 29, 2020

Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair (#3994)

* Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair and

* The style and quality are now top-notch

8ba4c588

MarianMTModel.from_pretrained('Helsinki-NLP/opus-marian-en-de') (#3908) · 847e7f33
Sam Shleifer authored Apr 28, 2020
```
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
```
847e7f33

18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

16 Apr, 2020 1 commit
- [PretrainedTokenizer] Factor out tensor conversion method (#3777) · 16469fed
  Sam Shleifer authored Apr 16, 2020
  
  16469fed
07 Apr, 2020 1 commit
- [Tokenization] fix edge case for bert tokenization (#3517) · b0ad0695
  Patrick von Platen authored Apr 07, 2020
```
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
```
  b0ad0695
06 Apr, 2020 1 commit

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8