Commits · ba8c4d0ac04acfcdbdeaed954f698d6d5ec3e532 · chenpangpang / transformers

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

24 Sep, 2020 1 commit
- Clean RAG docs and template docs (#7348) · 0ccb6f5c
  Sylvain Gugger authored Sep 24, 2020
```
* Clean RAG docs and template docs

* Fix typo

* Better doc
```
  0ccb6f5c
04 Sep, 2020 1 commit
- [doc] remove the implied defaults to :obj:`None`, s/True/ :obj:`True/, etc. (#6956) · 48ff6d51
  Stas Bekman authored Sep 04, 2020
```
* remove the implied defaults to :obj:`None`

* fix bug in the original

* replace to :obj:`True`, :obj:`False`
```
  48ff6d51
24 Jul, 2020 1 commit
- Update the new model template (#6019) · a884b7fa
  Sylvain Gugger authored Jul 24, 2020
  
  a884b7fa
26 Jun, 2020 1 commit

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

15 Jan, 2020 1 commit
- 💄 super · 83a41d39
  Julien Chaumond authored Jan 15, 2020
  
  83a41d39
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
05 Jan, 2020 1 commit
- Enforce target version for black. · 0ffc8eaf
  Aymeric Augustin authored Dec 27, 2019
```
This should stabilize formatting.
```
  0ffc8eaf
22 Dec, 2019 5 commits
- Use built-in open(). · 1c62e87b
  Aymeric Augustin authored Dec 22, 2019
```
On Python 3, `open is io.open`.
```
  1c62e87b
- Remove six. · 8af25b16
  Aymeric Augustin authored Dec 22, 2019
  
  8af25b16
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Fix F401 flake8 warning (x88 / 116). · 783a6169
  Aymeric Augustin authored Dec 21, 2019
```
This change is mostly autogenerated with:

    $ python -m autoflake --in-place --recursive --remove-all-unused-imports --ignore-init-module-imports examples templates transformers utils hubconf.py setup.py

I made minor changes in the generated diff.
```
  783a6169
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

18 Dec, 2019 1 commit
- Fix outdated tokenizer doc · a0d38645
  Julien Chaumond authored Dec 17, 2019
  
  a0d38645
11 Nov, 2019 1 commit
- Fix #1784 · b5d330d1
  Lysandre authored Nov 11, 2019
  
  b5d330d1
30 Oct, 2019 1 commit
- adding templates · 7f4226f9
  thomwolf authored Oct 30, 2019
  
  7f4226f9