Commits · 4d04120c6d76ed0ddf6525dd60c1211f4afffb2f · chenpangpang / transformers

06 Aug, 2020 1 commit

Add strip_accents to basic BertTokenizer. (#6280) · d5bc32ce

Philip May authored Aug 06, 2020

* Add strip_accents to basic tokenizer

* Add tests for strip_accents.

* fix style with black

* Fix strip_accents test

* empty commit to trigger CI

* Improved strip_accents check

* Add code quality with is not False

d5bc32ce

01 Jul, 2020 1 commit
- Move tests/utils.py -> transformers/testing_utils.py (#5350) · 13deb95a
  Sam Shleifer authored Jul 01, 2020
  
  13deb95a
15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

19 May, 2020 1 commit
- [cleanup] test_tokenization_common.py (#4390) · 07dd7c2f
  Sam Shleifer authored May 19, 2020
  
  07dd7c2f
06 Apr, 2020 1 commit

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

17 Jan, 2020 1 commit
- Fix BasicTokenizer to respect `never_split` parameters (#2557) · 65a89a89
  Mark Neumann authored Jan 17, 2020
```
* add failing test

* fix call to _run_split_on_punc

* format with black
```
  65a89a89
15 Jan, 2020 1 commit
- 💄 super · 83a41d39
  Julien Chaumond authored Jan 15, 2020
  
  83a41d39
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
05 Jan, 2020 1 commit
- Enforce target version for black. · 0ffc8eaf
  Aymeric Augustin authored Dec 27, 2019
```
This should stabilize formatting.
```
  0ffc8eaf
24 Dec, 2019 1 commit
- Add tests for fast tokenizers · 2818e505
  Anthony MOI authored Dec 24, 2019
  
  2818e505
22 Dec, 2019 8 commits
- Use built-in open(). · 1c62e87b
  Aymeric Augustin authored Dec 22, 2019
```
On Python 3, `open is io.open`.
```
  1c62e87b
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Replace CommonTestCases for tokenizers with a mixin. · 00204f2b
  Aymeric Augustin authored Dec 22, 2019
```
This is the same change as for (TF)CommonTestCases for modeling.
```
  00204f2b
- Rename file for consistency. · a3c5883f
  Aymeric Augustin authored Dec 22, 2019
  
  a3c5883f
- Remove unittest.main() in test modules. · 7e98e211
  Aymeric Augustin authored Dec 22, 2019
```
This construct isn't used anymore these days.

Running python tests/test_foo.py puts the tests/ directory on
PYTHONPATH, which isn't representative of how we run tests.

Use python -m unittest tests/test_foo.py instead.
```
  7e98e211
- Switch test files to the standard test_*.py scheme. · ced0a942
  Aymeric Augustin authored Dec 22, 2019
  
  ced0a942
- Move tests outside of library. · 067395d5
  Aymeric Augustin authored Dec 22, 2019
  
  067395d5
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

13 Dec, 2019 5 commits
- Tests for all tokenizers · c3248cf1
  LysandreJik authored Dec 11, 2019
  
  c3248cf1
- better for python2.x · f2ac50cb
  Pascal Voitot authored Dec 10, 2019
  
  f2ac50cb
- missed space · 4cbdc7d9
  Pascal Voitot authored Dec 10, 2019
  
  4cbdc7d9
- more tests · dd2add9f
  Pascal Voitot authored Dec 10, 2019
  
  dd2add9f
- 🐛 #2096 in tokenizer.decode, space is not joined between all subtexts... · df160af7
  Pascal Voitot authored Dec 10, 2019
```
🐛 #2096 in tokenizer.decode, space is not joined between all subtexts instead of before added tokens
```
  df160af7
06 Dec, 2019 1 commit

Remove dependency on pytest for running tests (#2055) · 35401fe5

Aymeric Augustin authored Dec 06, 2019

* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda

35401fe5

04 Nov, 2019 1 commit
- fix tests - flagged as slow all the tests downloading from AWS · b340a910
  thomwolf authored Nov 04, 2019
  
  b340a910
22 Oct, 2019 1 commit
- Remove · 7d709e55
  Lysandre authored Oct 22, 2019
  
  7d709e55
04 Oct, 2019 1 commit
- update encode_plus - add truncation strategies · 6c1d0bc0
  thomwolf authored Oct 04, 2019
  
  6c1d0bc0
26 Sep, 2019 1 commit
- [BIG] pytorch-transformers => transformers · 31c23bd5
  thomwolf authored Sep 26, 2019
  
  31c23bd5
19 Sep, 2019 1 commit
- Sentence -> Sequence. Removed output_mask from the special token addition methods. · bf503158
  LysandreJik authored Sep 19, 2019
  
  bf503158
30 Aug, 2019 1 commit
- added test and debug tokenizer configuration serialization · 69da972a
  thomwolf authored Aug 30, 2019
  
  69da972a
28 Aug, 2019 1 commit
- add dilbert tokenizer and tests · 62df4ba5
  thomwolf authored Aug 28, 2019
  
  62df4ba5
12 Aug, 2019 1 commit
- Added integration tests for sequence builders. · 634a3172
  LysandreJik authored Aug 12, 2019
  
  634a3172
05 Aug, 2019 1 commit
- cleaning up tokenizer tests structure (at last) - last remaining ppb refs · 328afb70
  thomwolf authored Aug 05, 2019
  
  328afb70
15 Jul, 2019 1 commit
- update tokenizer - update squad example for xlnet · 15d8b126
  thomwolf authored Jul 15, 2019
  
  15d8b126
09 Jul, 2019 2 commits
- fix python 2 tests · c079d7dd
  thomwolf authored Jul 09, 2019
  
  c079d7dd
- unified tokenizer api and serialization + tests · b1978698
  thomwolf authored Jul 09, 2019
  
  b1978698
05 Jul, 2019 2 commits
- tokenization abstract class - tests for examples · 36bca545
  thomwolf authored Jul 05, 2019
  
  36bca545
- [BIG] name change · 0bab55d5
  thomwolf authored Jul 05, 2019
  
  0bab55d5