Commits · 75bbf20bce5bb6b0042deff41e051db5324f9d54 · chenpangpang / transformers

24 May, 2023 3 commits

Zachary Mueller authored May 24, 2023

* Check for use_sagemaker_dp

* Add a check for is_sagemaker_mp when setting _n_gpu again. Should be last broken thing

* Try explicit check?

* Quality

75bbf20b

Paged Optimizer + Lion Optimizer for Trainer (#23217) · 796162c5

Tim Dettmers authored May 24, 2023



* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

---------
Co-authored-by: younesbelkada <younesbelkada@gmail.com>

796162c5

4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) (#23479) · 9d73b922

Tim Dettmers authored May 24, 2023



* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added fix for fp32 layer norms and bf16 compute in LLaMA.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Fixing issues for PR #23479.

* Added fix for fp32 layer norms and bf16 compute in LLaMA.

* Reverted variable name change.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Added missing tests.

* Fixup changes.

* Added fixup changes.

* Missed some variables to rename.

* revert trainer tests

* revert test trainer

* another revert

* fix tests and safety checkers

* protect import

* simplify a bit

* Update src/transformers/trainer.py

* few fixes

* add warning

* replace with `load_in_kbit = load_in_4bit or load_in_8bit`

* fix test

* fix tests

* this time fix tests

* safety checker

* add docs

* revert torch_dtype

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* multiple fixes

* update docs

* version checks and multiple fixes

* replace `is_loaded_in_kbit`

* replace `load_in_kbit`

* change methods names

* better checks

* oops

* oops

* address final comments

---------
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

9d73b922

23 May, 2023 1 commit

fix: load_best_model_at_end error when load_in_8bit is True (#23443) · 357f281b

小桐桐 authored May 24, 2023

Ref: https://github.com/huggingface/peft/issues/394
Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported.
call module.cuda() before module.load_state_dict()

357f281b

17 May, 2023 1 commit
- Remove hardcoded prints in Trainer (#23432) · 0f2c7382
  Hugo Abonizio authored May 17, 2023
  
  0f2c7382
16 May, 2023 1 commit
- Why crash the whole run when HFHub gives a 50x error? (#23320) · 17d0290e
  ropoctl authored May 16, 2023
```
Logging an error and continuing is probably following the principle of least surprise.
```
  17d0290e
09 May, 2023 1 commit

Support ratios for `logging_steps`, `eval_steps`, and `save_steps` (#23235) · 650a71e1

Konstantin Dobler authored May 09, 2023



* Ratio option for `logging_steps`, `eval_steps`, `save_steps`

* Add guards if arguments are not set

* Add more detailed comments + formatting

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Convert args values to `int` if bigger than 1

* `black`

* `make fixup`

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

650a71e1

04 May, 2023 1 commit
- fix resume fsdp (#23111) · adb0760b
  Qingyang Wu authored May 04, 2023
```
* fix resume fsdp

* fix rank 0 loading

* fix style and quality
```
  adb0760b
02 May, 2023 1 commit
- Fix check for backword_pos (#23075) · c6c66584
  Wing Lian authored May 02, 2023
  
  c6c66584
28 Apr, 2023 2 commits

Cuda rng_state_all is used when saving in distributed mode so same should also... · 4d0ea3d2

Shivam Shrirao authored Apr 28, 2023

Cuda rng_state_all is used when saving in distributed mode so same should also be used when loading (#23045)

cuda rng state should be all for distributed bc all were saved

4d0ea3d2

Add Trainer support for ReduceLROnPlateau (#23010) · 9b435204

Maxime Méloux authored Apr 28, 2023



* Add Trainer support for ReduceLROnPlateau

Fixes #16503

* Remove training argument and add default instance

---------
Co-authored-by: mmeloux <maxime.meloux@loria.fr>

9b435204

21 Apr, 2023 1 commit
- ddp fixes for training (#22874) · d00997e6
  Wing Lian authored Apr 21, 2023
```
ddp fixes for stable lm training
```
  d00997e6
19 Apr, 2023 1 commit

move preprocess_logits_for_metrics before _nested_gather in trainer.e… (#22603) · 6bd8ae26

Liu Chenyang authored Apr 19, 2023



* move preprocess_logits_for_metrics before _nested_gather in trainer.evaluation_loop

* fix

* Update src/transformers/trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix

* fix

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6bd8ae26

17 Apr, 2023 1 commit

Introduce `PartialState` as the device handler in the `Trainer` (#22752) · 03462875

Zachary Mueller authored Apr 17, 2023



* Use accelerate for device management

* Add accelerate to setup
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

03462875

07 Apr, 2023 1 commit
- Fix typo (#22650) · f2cc8ffd
  Seung-Moo Yang authored Apr 07, 2023
  
  f2cc8ffd
06 Apr, 2023 2 commits
- fix FSDP version related issues (#22489) · ee8e80a0
  Sourab Mangrulkar authored Apr 07, 2023
```
fix fsdp
```
  ee8e80a0
- [`bnb`] 8bit models should not be converted to `DDP` (#22628) · 09a9888f
  Younes Belkada authored Apr 06, 2023
```
add safety checker
```
  09a9888f
05 Apr, 2023 1 commit

Add thousands separator in training summary (#22583) · 4861c258

Quentin Meeus authored Apr 05, 2023

The logger prints a summary at the beginning of training that displays some info such as number of examples, number of parameters, total number of steps, etc. Those numbers can be quite large and difficult to read. I added a thousand separator to improve readability for the following:
- num_examples
- num_train_epochs
- per_device_train_batch_size
- total_train_batch_size
- max_steps
- num_trainable_params

4861c258

04 Apr, 2023 1 commit

Implemented safetensors checkpoints save/load for Trainer (#22498) · 871598be

Viktor Scherbakov authored Apr 04, 2023



* implemented safetensors save/load

* remove duplicated file

* added tests

* more tests

* style fix

* fix tf tests

* change to list comprehension
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* review fixes + safe load for sharded checkpoint

* style fix

* remove rogue import

* remove partial to avoid undefined exception

* use naming alias instead of safetensors.torch

* fix safe sharding in tests

* grammar
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* minor corrections

* style

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

871598be

03 Apr, 2023 3 commits

[setup] drop deprecated `distutils` usage (#22531) · 80d1319e

Xuehai Pan authored Apr 04, 2023

* [setup] drop deprecated `distutils` usage

* drop deprecated `distutils.util.strtobool` usage

* fix import order

* reformat docstring by `doc-builder`

80d1319e

Fix missing metrics with multiple eval datasets (#22536) · 4c33a0c4
Ilya authored Apr 03, 2023

4c33a0c4

[`Trainer`] Force `is_model_parallel` when model is loaded in multiple GPUs... · cab048fb

Younes Belkada authored Apr 03, 2023

[`Trainer`] Force `is_model_parallel` when model is loaded in multiple GPUs using `accelerate` (#22532)

* add `is_model_parallel` arg on Trainer

* add warning

* adapt from suggestions

* revert t5 changes

* remove commas

* adapt from suggestions

cab048fb

29 Mar, 2023 1 commit
- Revert "Fix --bf16 option support for Neuron after PR #22300" (#22451) · 5e89a435
  jeffhataws authored Mar 29, 2023
```
This reverts commit fd81746dbec5f17c8285a0fdc72ca4b4c025cc33.
```
  5e89a435
23 Mar, 2023 2 commits

Fix --bf16 option support for Neuron after PR #22300 (#22307) · ec9b18f6

jeffhataws authored Mar 23, 2023

This PR fixes the "RuntimeError: No CUDA GPUs are available"
when running with --bf16 option on Neuron.

Related PRs:
https://github.com/huggingface/transformers/pull/20684
https://github.com/huggingface/transformers/pull/22300

ec9b18f6

Mention why one needs to specify max_steps in Trainer (#22333) · 053c2153
Quentin Lhoest authored Mar 23, 2023
```
* Mention why one needs to specify max_steps in Trainer

* dummy change to trigger CI
```
053c2153

22 Mar, 2023 1 commit
- docs: Resolve incorrect type typo in trainer methods (#22316) · f48d3314
  Tom Aarsen authored Mar 22, 2023
```
Resolve incorrect type typo in trainer methods
```
  f48d3314
21 Mar, 2023 1 commit
- Restore fp16 support on xla gpu device (#22300) · d35f7296
  Yanming W authored Mar 21, 2023
  
  d35f7296
20 Mar, 2023 2 commits
- Move torch.compile() wrapping after DDP/FSDP wrapping to ensure correct graph... · fb0a38b4
  Antoni Viros authored Mar 20, 2023
```
Move torch.compile() wrapping after DDP/FSDP wrapping to ensure correct graph breaks during training (#22279)
```
  fb0a38b4
- Proper map location for optimizer load (#22273) · da005253
  Sylvain Gugger authored Mar 20, 2023
```
* Proper map location for optimizer load

* What happened to my code?
```
  da005253
17 Mar, 2023 1 commit
- [trainer] param count for deepspeed zero3 (#22193) · 60d51ef5
  Stas Bekman authored Mar 17, 2023
```
[trainer] param count for zero3
```
  60d51ef5
14 Mar, 2023 3 commits
- Load optimizer state on CPU to avoid CUDA OOM (#22159) · b7036f49
  Sylvain Gugger authored Mar 14, 2023
  
  b7036f49
- Revert "Enforce same behavior as PyTorch 2.0 for older versions" (#22163) · c52c5282
  Sylvain Gugger authored Mar 14, 2023
```
Revert "Enforce same behavior as PyTorch 2.0 for older versions (#22136)"

This reverts commit 1c801d65.
```
  c52c5282
- [trainer] add `--optim adamw_torch_fused` for pt-2.0+ (#22144) · 085bf5c1
  Stas Bekman authored Mar 14, 2023
```
* [trainer] add --optim adamw_torch_fused

* change optim default

* deal with non-torch

* revert default change; prep; add fp16/amp assert

* typo

* typo
```
  085bf5c1
13 Mar, 2023 3 commits

Remove backend check for torch.compile (#22140) · 3a35937e

Sylvain Gugger authored Mar 13, 2023



* Remove backend enforcment for torch.compile

* Update error

* Update src/transformers/training_args.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Style

---------
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

3a35937e

[trainer] fix bug in grad accum with multiple epochs (#22098) · 5b85add7
Stas Bekman authored Mar 13, 2023
```
* [trainer] fix bug in grad accum

* comment out debug

* fix one-off

* rename counter
```
5b85add7
Enforce same behavior as PyTorch 2.0 for older versions (#22136) · 1c801d65
Sylvain Gugger authored Mar 13, 2023

1c801d65

09 Mar, 2023 1 commit

Return analysis for hyperparameter_search with Ray backend (#22040) · 04bfac83

anruijian authored Mar 09, 2023

* return analysis for hyperparameter_search with ray backend

* Revert "return analysis for hyperparameter_search with ray backend"

This reverts commit cd5179070930e03020d96d98eb51dec3eb21ef75.

* add run_summary attribute to BestRun and return analysis for ray backend

* fix typo

* add doc for run_summary for ray backend

04bfac83

08 Mar, 2023 1 commit
- Fix test for torchneuroncore in Trainer (#22028) · a5392ee7
  Sylvain Gugger authored Mar 08, 2023
  
  a5392ee7
06 Mar, 2023 1 commit

Disable DDP for neuron (#21953) · 0bb17295

aws-sangeetha authored Mar 06, 2023



Disable DDp for neuron
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-42-72.us-west-2.compute.internal>

0bb17295

02 Mar, 2023 1 commit
- fsdp bf16 enable autocast (#21847) · b6f47b53
  Sourab Mangrulkar authored Mar 02, 2023
  
  b6f47b53