Commits · 08f534d2da47875a4b7eb1c125cfa7f0f3b79642 · chenpangpang / transformers

26 Oct, 2020 1 commit

Sylvain Gugger authored Oct 26, 2020

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy

08f534d2

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

23 Sep, 2020 1 commit

Models doc (#7345) · 3323146e

Sylvain Gugger authored Sep 23, 2020



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3323146e

26 Aug, 2020 2 commits

Black 20 release · a75c64d8
Lysandre authored Aug 26, 2020

a75c64d8

Centralize logging (#6434) · 77abd1e7

Lysandre Debut authored Aug 26, 2020



* Logging

* Style

* hf_logging > utils.logging

* Address @thomwolf's comments

* Update test

* Update src/transformers/benchmark/benchmark_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Revert bad change
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

77abd1e7

18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

25 Feb, 2020 1 commit

Documentation (#2989) · bb7c4685

Lysandre Debut authored Feb 25, 2020

* All Tokenizers

BertTokenizer + few fixes
RobertaTokenizer
OpenAIGPTTokenizer + Fixes
GPT2Tokenizer + fixes
TransfoXLTokenizer
Correct rst for TransformerXL
XLMTokenizer + fixes
XLNet Tokenizer + Style
DistilBERT + Fix XLNet RST
CTRLTokenizer
CamemBERT Tokenizer
FlaubertTokenizer
XLMRobertaTokenizer
cleanup

* cleanup

bb7c4685

20 Feb, 2020 1 commit
- Add get_vocab method to PretrainedTokenizer · 197d74f9
  Joe Davison authored Feb 20, 2020
  
  197d74f9
15 Jan, 2020 1 commit
- 💄 super · 83a41d39
  Julien Chaumond authored Jan 15, 2020
  
  83a41d39
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
22 Dec, 2019 7 commits
- Use built-in open(). · 1c62e87b
  Aymeric Augustin authored Dec 22, 2019
```
On Python 3, `open is io.open`.
```
  1c62e87b
- Remove six. · 8af25b16
  Aymeric Augustin authored Dec 22, 2019
  
  8af25b16
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Move source code inside a src subdirectory. · 6be7cdda
  Aymeric Augustin authored Dec 22, 2019
```
This prevents transformers from being importable simply because the CWD
is the root of the git repository, while not being importable from other
directories. That led to inconsistent behavior, especially in examples.

Once you fetch this commit, in your dev environment, you must run:

    $ pip uninstall transformers
    $ pip install -e .
```
  6be7cdda
- Fix E722 flake8 warnings (x26). · 631be270
  Aymeric Augustin authored Dec 21, 2019
  
  631be270
- Fix E231 flake8 warning (x9). · ea89bec1
  Aymeric Augustin authored Dec 21, 2019
  
  ea89bec1
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

06 Dec, 2019 1 commit

Remove dependency on pytest for running tests (#2055) · 35401fe5

Aymeric Augustin authored Dec 06, 2019

* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda

35401fe5

05 Dec, 2019 1 commit
- fix #1920 · 8b388827
  thomwolf authored Dec 05, 2019
  
  8b388827
22 Oct, 2019 2 commits
- [CTRL] warn if generation prompt does not start with a control code · ef1b8b2a
  Julien Chaumond authored Oct 22, 2019
```
see also https://github.com/salesforce/ctrl/pull/50
```
  ef1b8b2a
- Fix #1597 · 777faa8a
  Lysandre authored Oct 22, 2019
  
  777faa8a
10 Oct, 2019 2 commits
- move back to simple space spliting · 177a7212
  thomwolf authored Oct 10, 2019
  
  177a7212
- switching to moses tokenizer · 43a237f1
  thomwolf authored Oct 10, 2019
  
  43a237f1
09 Oct, 2019 1 commit
- Temporary CTRL tokenizer fix · 036483fa
  LysandreJik authored Oct 09, 2019
  
  036483fa
08 Oct, 2019 2 commits
- fix tokenization · 24831477
  thomwolf authored Oct 08, 2019
  
  24831477
- update tokenizer · 03c2c762
  thomwolf authored Oct 08, 2019
  
  03c2c762
04 Oct, 2019 1 commit

Adding CTRL (squashed commit) · dbed1c5d

keskarnitish authored Sep 30, 2019

adding conversion script

adding first draft of modeling & tokenization

adding placeholder for test files

bunch of changes

registering the tokenizer/model/etc

tests

change link; something is very VERY wrong here

weird end-of-word thingy going on

i think the tokenization works now ; wrote the unit tests

overall structure works;load w next

the monster is alive!

works after some cleanup as well

adding emacs autosave to gitignore

currently only supporting the 48 layer one; seems to infer fine on my macbook

cleanup

fixing some documentation

fixing some documentation

tests passing?

now works on CUDA also

adding greedy?

adding greedy sampling

works well

dbed1c5d

03 Oct, 2019 1 commit
- update links to new weights · 6be46a6e
  VictorSanh authored Oct 03, 2019
  
  6be46a6e
26 Sep, 2019 3 commits
- Update RoBERTa and GPT-2 Tokenizer documentation (fix #1343) · ecfddc60
  LysandreJik authored Sep 26, 2019
  
  ecfddc60
- [BIG] pytorch-transformers => transformers · 31c23bd5
  thomwolf authored Sep 26, 2019
  
  31c23bd5
- fix #1196 and fix #1285 · 7a99e4b1
  thomwolf authored Sep 26, 2019
  
  7a99e4b1
30 Aug, 2019 3 commits
- clean up all byte-level bpe tests · 5dd7b677
  thomwolf authored Aug 30, 2019
  
  5dd7b677
- update GPT2 docstring · fd10d79b
  thomwolf authored Aug 30, 2019
  
  fd10d79b
- Fix GPT2 and RoBERTa tokenizer to beging with a space - update Roberta tokenizer · 0517e7a1
  thomwolf authored Aug 30, 2019
  
  0517e7a1
23 Aug, 2019 1 commit
- max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified · 3bcbebd4
  thomwolf authored Aug 23, 2019
  
  3bcbebd4
21 Aug, 2019 2 commits
- Add max length · fdc487d8
  thomwolf authored Aug 21, 2019
  
  fdc487d8
- adding gpt-2 large · aa05dc89
  thomwolf authored Aug 21, 2019
  
  aa05dc89
04 Aug, 2019 1 commit
- big doc update [WIP] · 009273db
  thomwolf authored Aug 04, 2019
  
  009273db