Commits · 11fdde02719dbd20651c9f43cc6f54959fc6ede6 · chenpangpang / transformers

"test/git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "5ca9c2547769ff5d953124885539f97f2aa6a887"

23 Jun, 2020 1 commit

Tokenizers API developments (#5103) · 11fdde02

Thomas Wolf authored Jun 23, 2020



* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

11fdde02

18 Jun, 2020 1 commit
- Pin `sphinx-rtd-theme` (#5128) · 97343326
  Lysandre Debut authored Jun 18, 2020
  
  97343326
15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

09 Jun, 2020 1 commit

[Benchmark] add tpu and torchscipt for benchmark (#4850) · 2cfb947f

Patrick von Platen authored Jun 09, 2020



* add tpu and torchscipt for benchmark

* fix name in tests

* "fix email"

* make style

* better log message for tpu

* add more print and info for tpu

* allow possibility to print tpu metrics

* correct cpu usage

* fix test for non-install

* remove bugus file

* include psutil in testing

* run a couple of times before tracing in torchscript

* do not allow tpu memory tracing for now

* make style

* add torchscript to env

* better name for torch tpu
Co-authored-by: Patrick von Platen <patrick@huggingface.co>

2cfb947f

02 Jun, 2020 2 commits
- Repin versions · d976ef26
  Lysandre authored Jun 02, 2020
  
  d976ef26
- Release: v2.11.0 · b43c78e5
  Lysandre authored Jun 02, 2020
  
  b43c78e5
26 May, 2020 1 commit

Make transformers-cli cross-platform (#4131) · 8cc6807e

Bram Vanroy authored May 26, 2020

* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality

8cc6807e

22 May, 2020 3 commits
- Re-apply #4446 + add packaging dependency · 2c1ebb8b
  Julien Chaumond authored May 22, 2020
```
As discussed w/ @lysandrejik

packaging is maintained by PyPA (the Python Packaging Authority), and should be lightweight and stable
```
  2c1ebb8b
- Re-pin versions · ef22ba48
  Lysandre authored May 22, 2020
  
  ef22ba48
- Release: v2.10.0 · e0db6bbd
  Lysandre authored May 22, 2020
  
  e0db6bbd
14 May, 2020 3 commits

Conversion script to export transformers models to ONNX IR. (#4253) · db0076a9

Funtowicz Morgan authored May 14, 2020

* Added generic ONNX conversion script for PyTorch model.

* WIP initial TF support.

* TensorFlow/Keras ONNX export working.

* Print framework version info

* Add possibility to check the model is correctly loading on ONNX runtime.

* Remove quantization option.

* Specify ONNX opset version when exporting.

* Formatting.

* Remove unused imports.

* Make functions more generally reusable from other part of the code.

* isort happy.

* flake happy

* Export only feature-extraction for now

* Correctly check inputs order / filter before export.

* Removed task variable

* Fix invalid args call in load_graph_from_args.

* Fix invalid args call in convert.

* Fix invalid args call in infer_shapes.

* Raise exception and catch in caller function instead of exit.

* Add 04-onnx-export.ipynb notebook

* More WIP on the notebook

* Remove unused imports

* Simplify & remove unused constants.

* Export with constant_folding in PyTorch

* Let's try to put function args in the right order this time ...

* Disable external_data_format temporary

* ONNX notebook draft ready.

* Updated notebooks charts + wording

* Correct error while exporting last chart in notebook.

* Adressing @LysandreJik comment.

* Set ONNX opset to 11 as default value.

* Set opset param mandatory

* Added ONNX export unittests

* Quality.

* flake8 happy

* Add keras2onnx dependency on extras["tf"]

* Pin keras2onnx on github master to v1.6.5

* Second attempt.

* Third attempt.

* Use the right repo URL this time ...

* Do the same for onnxconverter-common

* Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2

* Correct commit hash.

* Addressing PR review: Optimization are enabled by default.

* Addressing PR review: small changes in the notebook

* setup.py comment about keras2onnx versioning.

db0076a9

Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
Julien Chaumond authored May 14, 2020
```
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
```
448c4672
[ci skip] Pin isort · 015f7812
Julien Chaumond authored May 14, 2020

015f7812

13 May, 2020 1 commit
- Release: v2.9.1 · 7cb203fa
  Lysandre authored May 13, 2020
  
  7cb203fa
12 May, 2020 2 commits

Allow BatchEncoding to be initialized empty. (#4316) · 7d7fe499

Funtowicz Morgan authored May 12, 2020

* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.

7d7fe499

pin TF to 2.1 (#4297) · 30e34386
Lysandre Debut authored May 11, 2020
```
* pin TF to 2.1

* Pin flake8 as well
```
30e34386

11 May, 2020 1 commit

[TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283) · 94b57bf7

Julien Plu authored May 11, 2020

* Fix the issue to properly run the accumulator with TF 2.2

* Apply style

* Fix training_args_tf for TF 2.2

* Fix the TF training args when only one GPU is available

* Remove the fixed version of TF in setup.py

94b57bf7

07 May, 2020 2 commits
- Pin isort and tf <= 2.1.0 · 2e578243
  Lysandre authored May 07, 2020
  
  2e578243
- Release: v2.9.0 · e7cfc1a3
  Lysandre authored May 07, 2020
  
  e7cfc1a3
05 May, 2020 1 commit

Pytorch 1.5.0 (#3973) · 79b1c696

Lysandre Debut authored May 05, 2020

* Standard deviation can no longer be set to 0

* Remove torch pinned version

* 9th instead of 10th, silly me

79b1c696

01 May, 2020 1 commit
- [testing] add timeout_decorator (#3543) · 18db92dd
  Sam Shleifer authored May 01, 2020
  
  18db92dd
27 Apr, 2020 1 commit
- rm boto3 dependency · 97a37548
  Julien Chaumond authored Apr 25, 2020
  
  97a37548
22 Apr, 2020 1 commit
- Bump tokenizers version to final 0.7.0 (#3898) · 13dd2acc
  Anthony MOI authored Apr 22, 2020
  
  13dd2acc
21 Apr, 2020 1 commit
- [ci] Pin torch version while we update · eb5601b0
  Julien Chaumond authored Apr 21, 2020
  
  eb5601b0
18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

10 Apr, 2020 1 commit
- Update tokenizers to 0.7.0-rc5 (#3705) · b7cf9f43
  Anthony MOI authored Apr 10, 2020
  
  b7cf9f43
06 Apr, 2020 4 commits

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

Re-pin isort · ea6dba27
LysandreJik authored Apr 06, 2020

ea6dba27
unpin isort for pypi · 11c3257a
LysandreJik authored Apr 06, 2020

11c3257a
Release: v2.8.0 · 36bffc81
LysandreJik authored Apr 06, 2020

36bffc81

30 Mar, 2020 3 commits
- Re-pin isort version · eff757f2
  LysandreJik authored Mar 30, 2020
  
  eff757f2
- Un-pin isort for v2.7.0 pypi · a009d751
  LysandreJik authored Mar 30, 2020
  
  a009d751
- Release: v2.7.0 · 6f5a12a5
  LysandreJik authored Mar 30, 2020
  
  6f5a12a5
26 Mar, 2020 1 commit
- revert unpin isort commit · b4fb94fe
  Patrick von Platen authored Mar 26, 2020
  
  b4fb94fe
25 Mar, 2020 1 commit

Experiment w/ dataclasses (including Py36) (#3423) · 83272a38

Julien Chaumond authored Mar 25, 2020

* [ci] Also run test_examples in py37

(will revert at the end of the experiment)

* InputExample: use immutable dataclass

* [deps] Install dataclasses for Py<3.7

* [skip ci] Revert "[ci] Also run test_examples in py37"

This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.

83272a38

24 Mar, 2020 2 commits
- v2.6.0 release: isort un-pinned · fbc5bf10
  LysandreJik authored Mar 24, 2020
  
  fbc5bf10
- Release: v2.6.0 · 471cce24
  LysandreJik authored Mar 24, 2020
  
  471cce24
23 Mar, 2020 3 commits
- [deps] scikit-learn's transient issue was fixed · ec6766a3
  Julien Chaumond authored Mar 23, 2020
  
  ec6766a3
- Correct order for dev/quality dependencies · e5248290
  LysandreJik authored Mar 23, 2020
```
cc @julien-c
```
  e5248290
- [ci] simpler way to load correct version of isort · 18eec3a9
  Julien Chaumond authored Mar 23, 2020
```
hat/tip @bramvanroy
```
  18eec3a9