Commits · a515caa331d232897e92282fe96bfceb857e38ff · chenpangpang / transformers

18 May, 2021 1 commit
- Fix checkpoint deletion (#11748) · a515caa3
  Sylvain Gugger authored May 18, 2021
  
  a515caa3
13 May, 2021 1 commit
- Fix loading the best model on the last stage of training (#11718) · 218d552f
  Volodymyr Byno authored May 13, 2021
  
  218d552f
11 May, 2021 2 commits

Test checkpointing (#11682) · f13f1f8f
Sylvain Gugger authored May 11, 2021
```
* Add test and see where CI is unhappy

* Load with strict=False
```
f13f1f8f

Sylvain Gugger authored May 11, 2021



* Autogenerate model cards from the Trainer

* ModelCard deprecated

* Fix test

* Style

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments

* Quality

* With all metadata

* Metadata

* Post-merge conflict mess

* Data args and all examples

* Default license and languages when possible
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

a135f595

10 May, 2021 1 commit
- Save scaler state dict when checkpointing (#11663) · 05a93067
  Sylvain Gugger authored May 10, 2021
  
  05a93067
06 May, 2021 1 commit

Fix RNG saves in distributed mode. (#11620) · 33fd83bc

Sylvain Gugger authored May 06, 2021



* Fix RNG saves in distributed mode.

* Update src/transformers/trainer.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

33fd83bc

04 May, 2021 1 commit

Reproducible checkpoint (#11582) · 6b241e0e

Sylvain Gugger authored May 04, 2021

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

6b241e0e

03 May, 2021 1 commit
- Accumulate opt state dict on do_rank 0 (#11481) · f4c9a7e6
  Sylvain Gugger authored May 03, 2021
  
  f4c9a7e6
30 Apr, 2021 1 commit

[debug utils] activation/weights underflow/overflow detector (#11274) · 282f3ac3

Stas Bekman authored Apr 30, 2021



* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* wrap long line

* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

282f3ac3

26 Apr, 2021 4 commits
- Pass along seed to DistributedSampler (#11406) · ab2cabb9
  Sylvain Gugger authored Apr 26, 2021
```
* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler
```
  ab2cabb9
- fix some typos in docs, comments, logging/errors (#11432) · b24ead87
  LSinev authored Apr 26, 2021
  
  b24ead87
- Add basic support for FP16 in SageMaker model parallelism (#11407) · d7633a4e
  Sylvain Gugger authored Apr 26, 2021
```
* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo
```
  d7633a4e
- make style (#11442) · 32dbb2d9
  Patrick von Platen authored Apr 26, 2021
  
  32dbb2d9
23 Apr, 2021 2 commits

Trainer push to hub (#11328) · bf2e0cf7

Sylvain Gugger authored Apr 23, 2021



* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Address review comments

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

bf2e0cf7

Fixed trainer total_flos relaoding in distributed mode (#11383) · 7bc86bea
Teven authored Apr 23, 2021
```
* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training
```
7bc86bea

22 Apr, 2021 1 commit
- Fix Trainer with remove_unused_columns=False (#11382) · 3ed5e97b
  Sylvain Gugger authored Apr 22, 2021
```
* Fix Trainer with remove_unused_columns=False

* Typo
```
  3ed5e97b
21 Apr, 2021 1 commit

[deepspeed] fix resume from checkpoint (#11352) · ca7ff64f

Stas Bekman authored Apr 21, 2021

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

ca7ff64f

20 Apr, 2021 2 commits
- Update to use datasets remove_cloumns method (#11343) · f1b938fd
  Sylvain Gugger authored Apr 20, 2021
```
* Update to use datasets remove_cloumns method

* Quality
```
  f1b938fd
- Load checkpoint without re-creating the model (#11318) · c0328a6c
  Sylvain Gugger authored Apr 19, 2021
  
  c0328a6c
19 Apr, 2021 2 commits
- [Trainer] Add a progress bar for batches skipped (#11324) · 95037a16
  Sylvain Gugger authored Apr 19, 2021
  
  95037a16
- [Trainer] fix the placement on device with fp16_full_eval (#11322) · 95ffbe16
  Stas Bekman authored Apr 19, 2021
```
* fix the placement on device with fp16_full_eval

* deepspeed never goes on device
```
  95ffbe16
16 Apr, 2021 1 commit

Trainer support for IterableDataset for evaluation and predict (#11286) · d9c62047

Sylvain Gugger authored Apr 16, 2021

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

d9c62047

15 Apr, 2021 1 commit
- Support for set_epoch (#11258) · 6e1ee47b
  Sylvain Gugger authored Apr 15, 2021
  
  6e1ee47b
14 Apr, 2021 1 commit

Trainer iterable dataset (#11254) · aaaed56f

Sylvain Gugger authored Apr 14, 2021



* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

aaaed56f

08 Apr, 2021 4 commits

[setup] make fairscale and deepspeed setup extras (#11151) · c2e0fd52

Stas Bekman authored Apr 08, 2021



* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* no reason not to ask for the good version

* update the CIs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

c2e0fd52

[trainer] solve "scheduler before optimizer step" warning (#11144) · 1ed24afe
Stas Bekman authored Apr 08, 2021
```
* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test
```
1ed24afe

[DeepSpeed] ZeRO Stage 3 (#10753) · c6d66484

Stas Bekman authored Apr 08, 2021



* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

c6d66484

Fix typing error in Trainer class (prediction_step) (#11138) · f8e90d6f

Jannis Born authored Apr 08, 2021

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

f8e90d6f

31 Mar, 2021 2 commits

Merge trainers (#10975) · cd56f3fe

Sylvain Gugger authored Mar 31, 2021



* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

cd56f3fe

Enforce string-formatting with f-strings (#10980) · acc3bd9d

Sylvain Gugger authored Mar 31, 2021



* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

acc3bd9d

29 Mar, 2021 1 commit

Allow use of pre-computed lengths when grouping by length. (#10953) · ae6b6963

pcuenca authored Mar 29, 2021

A new argument `length_column_name` has been added to
`TrainingArguments`, with default value `"length"`. If this column
exists and `group_by_length` is `True`, the train sampler will use
it for grouping rather than computing it before training starts.

This is an optimization that allows the user to prepare data for fast
processing, preventing sequential access to the dataset as described in
issue #10909.

ae6b6963

24 Mar, 2021 1 commit

error type of tokenizer in __init__ definition (#10879) · f81077fc

imzhengzx authored Mar 24, 2021

the orignal code in line 246 is
```
tokenizer: Optional["PreTrainedTokenizerBase"] = None,
```

it should be
```
tokenizer: Optional[PreTrainedTokenizerBase] = None,
```

f81077fc

23 Mar, 2021 1 commit
- fixed typo (#10861) · eb330e89
  Bhadresh Savani authored Mar 23, 2021
  
  eb330e89
22 Mar, 2021 2 commits
- Modify the Trainer class to handle simultaneous execution of Ray Tune and Weights & Biases (#10823) · a8d4d677
  Ruan Chaves authored Mar 22, 2021
```
* Modify the _hp_search_setup method on the Trainer class to handle the wandb argument passed by Ray Tune to model config.

* Reformat single quotes as double quotes.
```
  a8d4d677
- Add simple one character fix so that on_step_begin and on_step_end are called... · b230181d
  Sidd Karamcheti authored Mar 22, 2021
```
Add simple one character fix so that on_step_begin and on_step_end are called at the right times (#10839)
```
  b230181d
18 Mar, 2021 1 commit
- Fix distributed evaluation (#10795) · 008672e6
  Sylvain Gugger authored Mar 18, 2021
```
* Fix distributed evaluation

* Use logger
```
  008672e6
17 Mar, 2021 3 commits

Smmp batch not divisible by microbatches fix (#10778) · 0282e24e

Mansi Mane authored Mar 17, 2021



* Added debug prints

* Added config

* Added prints

* Added prints

* Added extra samples to SequentialDistributedSampler

* Added extra samples to SequentialDistributedSampler

Updated SequentialDistributedSampler call

* Added deubg prints

* Removed extra prints

* Making predicitons and labels multiple of batchsize

* updated number of microbatches

* Removed extra prints

* Made start_remainder similar to DistributedSamplerWithLoop

* Minor spacing update

* Added debug prints

Added config

Added prints

Added prints

* Added extra samples to SequentialDistributedSampler

Updated SequentialDistributedSampler call

Added extra samples to SequentialDistributedSampler

Added deubg prints

Removed extra prints

Making predicitons and labels multiple of batchsize

updated number of microbatches

Removed extra prints

Squashing redundant commits

* Made start_remainder similar to DistributedSamplerWithLoop

Minor spacing update

Made start_remainder similar to DistributedSamplerWithLoop

* Test and styling

* Rename test
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

0282e24e

make failure to find a resume checkpoint fatal + tests (#10777) · 3318c246
Stas Bekman authored Mar 17, 2021

3318c246
[DeepSpeed] improve checkpoint loading code plus tests (#10760) · cd8c93f7
Stas Bekman authored Mar 17, 2021
```
* deepspeed checkpoint loading code plus tests

* style

* style
```
cd8c93f7

16 Mar, 2021 1 commit

[Deepspeed] Allow HF optimizer and scheduler to be passed to deepspeed (#10464) · c83fbc5f

Cheng Li authored Mar 16, 2021



* pass hf optimizer and scheduler to deepspeed if not specified in ds config

* pass hf optimizer and scheduler to deepspeed if not specified in ds config

* update

* make init_deepspeed support config dict

* fix docstring formatting

* clean up trainer's comments

* add new tests

* fix type

* composit argparse doesn't work

* style

* add a new test, rename others

* document new functionality

* complete tests, add docs

* style

* correct level

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add new methods to the doc

* must tell DS we are using a non-native optimizer

* add protection against cpu_offload + HF optimizer combo

* fix the cli overrides

* sync docs + tests

* restore AdamW

* better docs

* need new version

* no longer needed

* remove outdate information

* refactor duplicated code
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

c83fbc5f