Commits · 3c682ea15cf50636360545ba88a325868d194b0d · chenpangpang / transformers

"docs/source/vscode:/vscode.git/clone" did not exist on "37be3786cf1de9d21233f543c231866e68954998"

23 Oct, 2020 3 commits

[Examples] Allow EncoderDecoderModels to be trained with Seq2Seq (#7809) · 3c682ea1

Patrick von Platen authored Oct 23, 2020

* Make Seq2Seq Trainer more similar to Trainer

* fix typo

* fix seq2seq trainer

* remove from tests

* remove lock

* remove train files

* delete test files

* correct typo

* check at init

* make sure trainer is not slowed down on TPU

* correct isort

* remove use cache

* fix use cache

* add last use chache = false

3c682ea1

Handling longformer model_type (#7990) · d39da5a2

Ethan Perez authored Oct 23, 2020

Updating the run_squad training script to handle the "longformer" `model_type`. The longformer is trained in the same was as RoBERTa, so I've added the "longformer" `model_type` (that's the right hugginface name for the LongFormer model, right?) everywhere there was a "roberta" `model_type` reference. The longformer (like RoBERTa) doesn't use `token_type_ids` (as I understand from looking at the [longformer notebook](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb), which is what gets updated after this change.

This fix might be related to [this issue](https://github.com/huggingface/transformers/issues/7249) with SQuAD training when using run_squad.py

d39da5a2

Handle the case when title is None (#7941) · 88b3a91e
Lalit Pagaria authored Oct 23, 2020

88b3a91e

22 Oct, 2020 3 commits

[s2s trainer] tests to use distributed on multi-gpu machine (#7965) · 023f0f37
Stas Bekman authored Oct 22, 2020

023f0f37

New run glue script (#7917) · 2e5052d4

Sylvain Gugger authored Oct 22, 2020



* Start simplification

* More progress

* Finished script

* Address comments and update tests instructions

* Wrong test

* Accept files as inputs and fix test

* Update src/transformers/trainer_utils.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Fix labels and add combined score

* Add special labels

* Update TPU command

* Revert to old label strategy

* Use model labels

* Fix for STT-B

* Styling

* Apply suggestions from code review
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Code styling

* Fix review comments
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

2e5052d4

# Add whole word mask support for lm fine-tune (#7925) · a16e568f

wlhgtc authored Oct 22, 2020



* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

a16e568f

21 Oct, 2020 1 commit
- [seq2seq testing] multigpu test run via subprocess (#7281) · 8b381733
  Stas Bekman authored Oct 21, 2020
```
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
  8b381733
20 Oct, 2020 2 commits

[s2s] create doc for pegasus/fsmt replication (#7934) · 0e24e4c1
Stas Bekman authored Oct 20, 2020

0e24e4c1

[testing] rename skip targets + docs (#7863) · 3e31e7f9

Stas Bekman authored Oct 20, 2020



* rename skip targets + docs

* fix quotes

* style

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* small improvements

* fix
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

3e31e7f9

19 Oct, 2020 1 commit

Allow Custom Dataset in RAG Retriever (#7763) · 033f29c6

Quentin Lhoest authored Oct 19, 2020

* add CustomHFIndex

* typo in config

* update tests

* add custom dataset example

* clean script

* update test data

* minor in test

* docs

* docs

* style

* fix imports

* allow to pass the indexed dataset directly

* update tests

* use multiset DPR

* address thom and patrick's comments

* style

* update dpr tokenizer

* add output_dir flag in use_own_knowledge_dataset.py

* allow custom datasets in examples/rag/finetune.py

* add test for custom dataset in distributed rag retriever

033f29c6

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

17 Oct, 2020 1 commit
- [s2s testing] turn all to unittests, use auto-delete temp dirs (#7859) · 9f7b2b24
  Stas Bekman authored Oct 17, 2020
  
  9f7b2b24
16 Oct, 2020 5 commits
- [seq2seq testing] improve readability (#7845) · 1652ddad
  Stas Bekman authored Oct 16, 2020
  
  1652ddad
- Fix missing reference titles in retrieval evaluation of RAG (#7817) · 466115b2
  Quentin Lhoest authored Oct 16, 2020
  
  466115b2
- [testing] disable FutureWarning in examples tests (#7842) · 464b53f5
  Stas Bekman authored Oct 16, 2020
```
* [testing] disable FutureWarning in examples tests

same as tests/conftest.py, we can't resolve those warning, so turn the noise off.

* fix
```
  464b53f5
- [cleanup] assign todos, faster bart-cnn test (#7835) · 96e47d92
  Sam Shleifer authored Oct 16, 2020
```
* 2 beam output

* unassign/remove TODOs

* remove one more
```
  96e47d92
- [seq2seq] get_git_info fails gracefully (#7843) · 2255c2c7
  Stas Bekman authored Oct 15, 2020
```
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
  2255c2c7
15 Oct, 2020 1 commit
- Set XLA example time to 500s · 2485b8b0
  Lysandre authored Oct 15, 2020
  
  2485b8b0
14 Oct, 2020 3 commits

Don't use `store_xxx` on optional bools (#7786) · bb9559a7
Sylvain Gugger authored Oct 14, 2020
```
* Don't use `store_xxx` on optional bools

* Refine test

* Refine test
```
bb9559a7

Add predict step accumulation (#7767) · a1d1b332

Sylvain Gugger authored Oct 14, 2020



* Add eval_accumulation_step and clean distributed eval

* Add TPU test

* Add TPU stuff

* Fix arg name

* Fix Seq2SeqTrainer

* Fix total_size

* Update src/transformers/trainer_pt_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Doc and add test to TPU

* Add unit test

* Adapt name
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

a1d1b332

fix examples/rag imports, tests (#7712) · 8feb0cc9
Sam Shleifer authored Oct 14, 2020

8feb0cc9

13 Oct, 2020 1 commit
- fixed lots of typos. (#7758) · 7e73c128
  Tiger authored Oct 13, 2020
  
  7e73c128
12 Oct, 2020 4 commits
- [marian] Automate Tatoeba-Challenge conversion (#7709) · 9c2b2db2
  Sam Shleifer authored Oct 12, 2020
  
  9c2b2db2
- Fix tf text class (#7724) · d9ffb87e
  Julien Plu authored Oct 12, 2020
```
* Fix test

* fix generic text classification

* fix test

* Fix tests
```
  d9ffb87e
- Fix code quality · d6175a42
  sgugger authored Oct 12, 2020
  
  d6175a42
- The input training data files (multiple files in glob format). (#7717) · f176e707
  Kelvin authored Oct 12, 2020
```
Very often splitting large files to smaller files can prevent tokenizer going out of memory in environment like Colab that does not have swap memory
```
  f176e707
11 Oct, 2020 1 commit
- [examples] bump pl=0.9.0 (#7053) · 827c5194
  Sam Shleifer authored Oct 11, 2020
  
  827c5194
09 Oct, 2020 2 commits
- Fix dataset cardinality (#7678) · 9ad83059
  Julien Plu authored Oct 09, 2020
```
* Fix test

* Fix cardinality issue

* Fix test
```
  9ad83059
- [s2s] Switch README urls to cdn (#7670) · 297233fa
  Sam Shleifer authored Oct 08, 2020
  
  297233fa
08 Oct, 2020 3 commits
- [pseudo] Switch URLS to CDN (#7661) · a1ecc90d
  Sam Shleifer authored Oct 08, 2020
  
  a1ecc90d
- [s2s] configure lr_scheduler from command line (#7641) · 06a973fd
  Suraj Patil authored Oct 08, 2020
  
  06a973fd
- [pseudolabels] cleanup markdown table (#7653) · aba4e229
  Sam Shleifer authored Oct 07, 2020
  
  aba4e229
07 Oct, 2020 2 commits

[s2s] release pseudolabel links and instructions (#7639) · e2bb9abb
Sam Shleifer authored Oct 07, 2020

e2bb9abb

Trainer callbacks (#7596) · 08ba4b49

Sylvain Gugger authored Oct 07, 2020



* Initial callback proposal

* Finish various callbacks

* Post-rebase conflicts

* Fix tests

* Don't use something that's not set

* Documentation

* Remove unwanted print.

* Document all models can work

* Add tests + small fixes

* Update docs/source/internal/trainer_utils.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Fix TF tests

* Real fix this time

* This one should work

* Fix typo

* Really fix typo
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

08ba4b49

06 Oct, 2020 2 commits
- [s2s] save first batch to json for debugging purposes (#6810) · 500be01c
  Sam Shleifer authored Oct 06, 2020
  
  500be01c
- Support T5 Distillation w/hidden state supervision (#7599) · d5d2744a
  Sam Shleifer authored Oct 05, 2020
  
  d5d2744a
04 Oct, 2020 1 commit
- [s2s] add config params like Dropout in Seq2SeqTrainingArguments (#7532) · 99cb924b
  Suraj Patil authored Oct 04, 2020
  
  99cb924b
02 Oct, 2020 1 commit
- [s2s] fix lockfile and peg distillation constants (#7545) · 9bdce3a4
  Sam Shleifer authored Oct 02, 2020
  
  9bdce3a4
01 Oct, 2020 2 commits
- [s2s] Adafactor support for builtin trainer (#7522) · de4d7b00
  Sam Shleifer authored Oct 01, 2020
  
  de4d7b00
- [s2s] trainer scripts: Remove --run_name, thanks sylvain! (#7521) · d3a9601a
  Sam Shleifer authored Oct 01, 2020
  
  d3a9601a