Commits · 074340339a6d6aede30c14c94ffe7b59a01786f1 · chenpangpang / transformers

24 Aug, 2020 1 commit
- Update repo to isort v5 (#6686) · a5737779
  Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
  a5737779
17 Aug, 2020 1 commit

Support additional dictionaries for BERT Japanese tokenizers (#6515) · 48c6c613

Masatoshi Suzuki authored Aug 17, 2020

* Update BERT Japanese tokenizers

* Update CircleCI config to download unidic

* Specify to use the latest dictionary packages

48c6c613

31 Jul, 2020 1 commit

Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304

Paul O'Leary McCann authored Jul 31, 2020



* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

cf3cf304

29 Jul, 2020 1 commit
- Fix TF CTRL model naming (#6134) · fc64559c
  Julien Plu authored Jul 29, 2020
  
  fc64559c
27 Jul, 2020 1 commit
- Pin TF while we wait for a fix · 9d0d3a66
  sgugger authored Jul 27, 2020
  
  9d0d3a66
18 Jul, 2020 1 commit
- Update tokenizers to 0.8.1.rc to fix Mac OS X issues (#5867) · eae6d8d1
  Sebastian authored Jul 18, 2020
  
  eae6d8d1
06 Jul, 2020 3 commits
- Post v3.0.2 release commit · 1d233286
  Lysandre authored Jul 06, 2020
  
  1d233286
- Release: v3.0.2 · b0892fa0
  Lysandre authored Jul 06, 2020
  
  b0892fa0
- Various tokenizers fixes (#5558) · 5787e4c1
  Anthony MOI authored Jul 06, 2020
```
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
```
  5787e4c1
03 Jul, 2020 2 commits
- unpining specific git versions in setup.py · b58a15a3
  Thomas Wolf authored Jul 03, 2020
  
  b58a15a3
- Release: 3.0.1 · fedabcd1
  Thomas Wolf authored Jul 03, 2020
  
  fedabcd1
02 Jul, 2020 1 commit
- Bans SentencePiece 0.1.92 (#5418) · 69d313e8
  Lysandre Debut authored Jul 02, 2020
  
  69d313e8
30 Jun, 2020 1 commit
- Repin versions · 90d13954
  Lysandre authored Jun 30, 2020
  
  90d13954
29 Jun, 2020 2 commits
- Release: v3.0.0 · b62ca595
  Lysandre authored Jun 29, 2020
  
  b62ca595
- Pin mecab for now (#5362) · 482c9178
  Sylvain Gugger authored Jun 29, 2020
  
  482c9178
25 Jun, 2020 1 commit

Refactor Code samples; Test code samples (#5036) · 364a5ae1

Lysandre Debut authored Jun 25, 2020



* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

364a5ae1

23 Jun, 2020 1 commit

Tokenizers API developments (#5103) · 11fdde02

Thomas Wolf authored Jun 23, 2020



* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

11fdde02

18 Jun, 2020 1 commit
- Pin `sphinx-rtd-theme` (#5128) · 97343326
  Lysandre Debut authored Jun 18, 2020
  
  97343326
15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

09 Jun, 2020 1 commit

[Benchmark] add tpu and torchscipt for benchmark (#4850) · 2cfb947f

Patrick von Platen authored Jun 09, 2020



* add tpu and torchscipt for benchmark

* fix name in tests

* "fix email"

* make style

* better log message for tpu

* add more print and info for tpu

* allow possibility to print tpu metrics

* correct cpu usage

* fix test for non-install

* remove bugus file

* include psutil in testing

* run a couple of times before tracing in torchscript

* do not allow tpu memory tracing for now

* make style

* add torchscript to env

* better name for torch tpu
Co-authored-by: Patrick von Platen <patrick@huggingface.co>

2cfb947f

02 Jun, 2020 2 commits
- Repin versions · d976ef26
  Lysandre authored Jun 02, 2020
  
  d976ef26
- Release: v2.11.0 · b43c78e5
  Lysandre authored Jun 02, 2020
  
  b43c78e5
26 May, 2020 1 commit

Make transformers-cli cross-platform (#4131) · 8cc6807e

Bram Vanroy authored May 26, 2020

* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality

8cc6807e

22 May, 2020 3 commits
- Re-apply #4446 + add packaging dependency · 2c1ebb8b
  Julien Chaumond authored May 22, 2020
```
As discussed w/ @lysandrejik

packaging is maintained by PyPA (the Python Packaging Authority), and should be lightweight and stable
```
  2c1ebb8b
- Re-pin versions · ef22ba48
  Lysandre authored May 22, 2020
  
  ef22ba48
- Release: v2.10.0 · e0db6bbd
  Lysandre authored May 22, 2020
  
  e0db6bbd
14 May, 2020 3 commits

Conversion script to export transformers models to ONNX IR. (#4253) · db0076a9

Funtowicz Morgan authored May 14, 2020

* Added generic ONNX conversion script for PyTorch model.

* WIP initial TF support.

* TensorFlow/Keras ONNX export working.

* Print framework version info

* Add possibility to check the model is correctly loading on ONNX runtime.

* Remove quantization option.

* Specify ONNX opset version when exporting.

* Formatting.

* Remove unused imports.

* Make functions more generally reusable from other part of the code.

* isort happy.

* flake happy

* Export only feature-extraction for now

* Correctly check inputs order / filter before export.

* Removed task variable

* Fix invalid args call in load_graph_from_args.

* Fix invalid args call in convert.

* Fix invalid args call in infer_shapes.

* Raise exception and catch in caller function instead of exit.

* Add 04-onnx-export.ipynb notebook

* More WIP on the notebook

* Remove unused imports

* Simplify & remove unused constants.

* Export with constant_folding in PyTorch

* Let's try to put function args in the right order this time ...

* Disable external_data_format temporary

* ONNX notebook draft ready.

* Updated notebooks charts + wording

* Correct error while exporting last chart in notebook.

* Adressing @LysandreJik comment.

* Set ONNX opset to 11 as default value.

* Set opset param mandatory

* Added ONNX export unittests

* Quality.

* flake8 happy

* Add keras2onnx dependency on extras["tf"]

* Pin keras2onnx on github master to v1.6.5

* Second attempt.

* Third attempt.

* Use the right repo URL this time ...

* Do the same for onnxconverter-common

* Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2

* Correct commit hash.

* Addressing PR review: Optimization are enabled by default.

* Addressing PR review: small changes in the notebook

* setup.py comment about keras2onnx versioning.

db0076a9

Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
Julien Chaumond authored May 14, 2020
```
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
```
448c4672
[ci skip] Pin isort · 015f7812
Julien Chaumond authored May 14, 2020

015f7812

13 May, 2020 1 commit
- Release: v2.9.1 · 7cb203fa
  Lysandre authored May 13, 2020
  
  7cb203fa
12 May, 2020 2 commits

Allow BatchEncoding to be initialized empty. (#4316) · 7d7fe499

Funtowicz Morgan authored May 12, 2020

* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.

7d7fe499

pin TF to 2.1 (#4297) · 30e34386
Lysandre Debut authored May 11, 2020
```
* pin TF to 2.1

* Pin flake8 as well
```
30e34386

11 May, 2020 1 commit

[TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283) · 94b57bf7

Julien Plu authored May 11, 2020

* Fix the issue to properly run the accumulator with TF 2.2

* Apply style

* Fix training_args_tf for TF 2.2

* Fix the TF training args when only one GPU is available

* Remove the fixed version of TF in setup.py

94b57bf7

07 May, 2020 2 commits
- Pin isort and tf <= 2.1.0 · 2e578243
  Lysandre authored May 07, 2020
  
  2e578243
- Release: v2.9.0 · e7cfc1a3
  Lysandre authored May 07, 2020
  
  e7cfc1a3
05 May, 2020 1 commit

Pytorch 1.5.0 (#3973) · 79b1c696

Lysandre Debut authored May 05, 2020

* Standard deviation can no longer be set to 0

* Remove torch pinned version

* 9th instead of 10th, silly me

79b1c696

01 May, 2020 1 commit
- [testing] add timeout_decorator (#3543) · 18db92dd
  Sam Shleifer authored May 01, 2020
  
  18db92dd
27 Apr, 2020 1 commit
- rm boto3 dependency · 97a37548
  Julien Chaumond authored Apr 25, 2020
  
  97a37548
22 Apr, 2020 1 commit
- Bump tokenizers version to final 0.7.0 (#3898) · 13dd2acc
  Anthony MOI authored Apr 22, 2020
  
  13dd2acc
21 Apr, 2020 1 commit
- [ci] Pin torch version while we update · eb5601b0
  Julien Chaumond authored Apr 21, 2020
  
  eb5601b0