Commits · 2c887cf8e0cb1ac96d28361ff3235a77f83c36ee · chenpangpang / transformers

"examples/legacy/vscode:/vscode.git/clone" did not exist on "61e191987d8aa0778e0f44613deaf7ad99253cab"

07 Jun, 2023 3 commits
- Do not prepare lr scheduler as it as the right number of steps (#24088) · 2c887cf8
  Sylvain Gugger authored Jun 07, 2023
```
* Do not prepare lr scheduler as it as the right number of steps

* Trigger CI

* Trigger CI

* Trigger CI

* Add fake comment

* Remove fake comment

* Trigger CI please!
```
  2c887cf8
- fix executable batch size issue (#24067) · 12298cb6
  Sourab Mangrulkar authored Jun 07, 2023
```
* fix executable batch size issue

* fix

* undo
```
  12298cb6
- Support PEFT models when saving the model using trainer (#24073) · 6daf7c31
  Younes Belkada authored Jun 07, 2023
```
* support PEFT models when saving the model using trainer

* fixup
```
  6daf7c31
05 Jun, 2023 1 commit
- fix trainer slow tests related to hyperparam search (#24011) · 460b8443
  Sourab Mangrulkar authored Jun 05, 2023
```
* fix trainer slow tests

* commit 2
```
  460b8443
02 Jun, 2023 1 commit

Trainer: fixed evaluate raising `KeyError` for ReduceLROnPlateau (#23952) · 8940d315

Claudius Kienle authored Jun 02, 2023



Trainer: fixed KeyError on evaluate for ReduceLROnPlateau
Co-authored-by: Claudius Kienle <claudius.kienle@artiminds.com>

8940d315

31 May, 2023 8 commits

remove the extra `accelerator.prepare` (#23914) · d13021e3
Sourab Mangrulkar authored May 31, 2023
```
remove the extra `accelerator.prepare` that slipped in with multiple update from main 😅
```
d13021e3
Fix Trainer when model is loaded on a different GPU (#23792) · 68d53bc7
Sylvain Gugger authored May 31, 2023

68d53bc7

accelerate deepspeed and gradient accumulation integrate (#23236) · a73b1d59

Sourab Mangrulkar authored May 31, 2023

* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments

* move fsdp handling to accelerate

* fixex

* fix saving

* shift torch dynamo handling to accelerate

* shift deepspeed integration and save & load utils to accelerate

* fix accelerate launcher support

* oops

* fix 🐛

* save ckpt fix

* Trigger CI

* nasty 🐛 😅

* as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate

* make tests happy

* quality ✨

* loss tracked needs to account for grad_acc

* fixing the deepspeed tests

* quality ✨

* 😅😅😅

* tests 😡

* quality ✨



* Trigger CI

* resolve comments and fix the issue with the previous merge from branch

* Trigger CI

* accelerate took over deepspeed integration

---------
Co-authored-by: Stas Bekman <stas@stason.org>

a73b1d59

Fix last instances of kbit -> quantized (#23797) · 9fea71b4
Sylvain Gugger authored May 31, 2023

9fea71b4

shift torch dynamo handling to accelerate (#23168) · 03db5910

Sourab Mangrulkar authored May 31, 2023

* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments

* move fsdp handling to accelerate

* fixex

* fix saving

* shift torch dynamo handling to accelerate

03db5910

move fsdp handling to accelerate (#23158) · 0b774074

Sourab Mangrulkar authored May 31, 2023

* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments

* move fsdp handling to accelerate

* fixex

* fix saving

0b774074

Smangrul/accelerate ddp integrate (#23151) · 1cf148a6

Sourab Mangrulkar authored May 31, 2023

* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments

1cf148a6

Smangrul/accelerate mp integrate (#23148) · 9f0646a5

Sourab Mangrulkar authored May 31, 2023

* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* address comments by removing debugging print statements

9f0646a5

26 May, 2023 1 commit

Log the right train_batch_size if using auto_find_batch_size and also log the... · edf77728

Zachary Mueller authored May 26, 2023

Log the right train_batch_size if using auto_find_batch_size and also log the adjusted value seperately. (#23800)

* Log right bs

* Log

* Diff message

edf77728

25 May, 2023 1 commit
- Fix psuh_to_hub in Trainer when nothing needs pushing (#23751) · 7d4fe85e
  Sylvain Gugger authored May 25, 2023
  
  7d4fe85e
24 May, 2023 3 commits

Fix sagemaker DP/MP (#23681) · 75bbf20b

Zachary Mueller authored May 24, 2023

* Check for use_sagemaker_dp

* Add a check for is_sagemaker_mp when setting _n_gpu again. Should be last broken thing

* Try explicit check?

* Quality

75bbf20b

Paged Optimizer + Lion Optimizer for Trainer (#23217) · 796162c5

Tim Dettmers authored May 24, 2023



* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

---------
Co-authored-by: younesbelkada <younesbelkada@gmail.com>

796162c5

4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) (#23479) · 9d73b922

Tim Dettmers authored May 24, 2023



* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added fix for fp32 layer norms and bf16 compute in LLaMA.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Fixing issues for PR #23479.

* Added fix for fp32 layer norms and bf16 compute in LLaMA.

* Reverted variable name change.

* Initial draft. Some tests fail.

* Fixed dtype bug.

* Fixed bug caused by torch_dtype='auto'.

* All test green for 8-bit and 4-bit layers.

* Added lion and paged optimizers and made original tests pass.

* Added tests for paged and lion optimizers.

* Added and fixed optimizer tests.

* Style and quality checks.

* Added missing tests.

* Fixup changes.

* Added fixup changes.

* Missed some variables to rename.

* revert trainer tests

* revert test trainer

* another revert

* fix tests and safety checkers

* protect import

* simplify a bit

* Update src/transformers/trainer.py

* few fixes

* add warning

* replace with `load_in_kbit = load_in_4bit or load_in_8bit`

* fix test

* fix tests

* this time fix tests

* safety checker

* add docs

* revert torch_dtype

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* multiple fixes

* update docs

* version checks and multiple fixes

* replace `is_loaded_in_kbit`

* replace `load_in_kbit`

* change methods names

* better checks

* oops

* oops

* address final comments

---------
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

9d73b922

23 May, 2023 1 commit

fix: load_best_model_at_end error when load_in_8bit is True (#23443) · 357f281b

小桐桐 authored May 24, 2023

Ref: https://github.com/huggingface/peft/issues/394
Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported.
call module.cuda() before module.load_state_dict()

357f281b

17 May, 2023 1 commit
- Remove hardcoded prints in Trainer (#23432) · 0f2c7382
  Hugo Abonizio authored May 17, 2023
  
  0f2c7382
16 May, 2023 1 commit
- Why crash the whole run when HFHub gives a 50x error? (#23320) · 17d0290e
  ropoctl authored May 16, 2023
```
Logging an error and continuing is probably following the principle of least surprise.
```
  17d0290e
09 May, 2023 1 commit

Support ratios for `logging_steps`, `eval_steps`, and `save_steps` (#23235) · 650a71e1

Konstantin Dobler authored May 09, 2023



* Ratio option for `logging_steps`, `eval_steps`, `save_steps`

* Add guards if arguments are not set

* Add more detailed comments + formatting

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Convert args values to `int` if bigger than 1

* `black`

* `make fixup`

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

650a71e1

04 May, 2023 1 commit
- fix resume fsdp (#23111) · adb0760b
  Qingyang Wu authored May 04, 2023
```
* fix resume fsdp

* fix rank 0 loading

* fix style and quality
```
  adb0760b
02 May, 2023 1 commit
- Fix check for backword_pos (#23075) · c6c66584
  Wing Lian authored May 02, 2023
  
  c6c66584
28 Apr, 2023 2 commits

Cuda rng_state_all is used when saving in distributed mode so same should also... · 4d0ea3d2

Shivam Shrirao authored Apr 28, 2023

Cuda rng_state_all is used when saving in distributed mode so same should also be used when loading (#23045)

cuda rng state should be all for distributed bc all were saved

4d0ea3d2

Add Trainer support for ReduceLROnPlateau (#23010) · 9b435204

Maxime Méloux authored Apr 28, 2023



* Add Trainer support for ReduceLROnPlateau

Fixes #16503

* Remove training argument and add default instance

---------
Co-authored-by: mmeloux <maxime.meloux@loria.fr>

9b435204

21 Apr, 2023 1 commit
- ddp fixes for training (#22874) · d00997e6
  Wing Lian authored Apr 21, 2023
```
ddp fixes for stable lm training
```
  d00997e6
19 Apr, 2023 1 commit

move preprocess_logits_for_metrics before _nested_gather in trainer.e… (#22603) · 6bd8ae26

Liu Chenyang authored Apr 19, 2023



* move preprocess_logits_for_metrics before _nested_gather in trainer.evaluation_loop

* fix

* Update src/transformers/trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix

* fix

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6bd8ae26

17 Apr, 2023 1 commit

Introduce `PartialState` as the device handler in the `Trainer` (#22752) · 03462875

Zachary Mueller authored Apr 17, 2023



* Use accelerate for device management

* Add accelerate to setup
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

03462875

07 Apr, 2023 1 commit
- Fix typo (#22650) · f2cc8ffd
  Seung-Moo Yang authored Apr 07, 2023
  
  f2cc8ffd
06 Apr, 2023 2 commits
- fix FSDP version related issues (#22489) · ee8e80a0
  Sourab Mangrulkar authored Apr 07, 2023
```
fix fsdp
```
  ee8e80a0
- [`bnb`] 8bit models should not be converted to `DDP` (#22628) · 09a9888f
  Younes Belkada authored Apr 06, 2023
```
add safety checker
```
  09a9888f
05 Apr, 2023 1 commit

Add thousands separator in training summary (#22583) · 4861c258

Quentin Meeus authored Apr 05, 2023

The logger prints a summary at the beginning of training that displays some info such as number of examples, number of parameters, total number of steps, etc. Those numbers can be quite large and difficult to read. I added a thousand separator to improve readability for the following:
- num_examples
- num_train_epochs
- per_device_train_batch_size
- total_train_batch_size
- max_steps
- num_trainable_params

4861c258

04 Apr, 2023 1 commit

Implemented safetensors checkpoints save/load for Trainer (#22498) · 871598be

Viktor Scherbakov authored Apr 04, 2023



* implemented safetensors save/load

* remove duplicated file

* added tests

* more tests

* style fix

* fix tf tests

* change to list comprehension
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* review fixes + safe load for sharded checkpoint

* style fix

* remove rogue import

* remove partial to avoid undefined exception

* use naming alias instead of safetensors.torch

* fix safe sharding in tests

* grammar
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* minor corrections

* style

---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

871598be

03 Apr, 2023 3 commits

[setup] drop deprecated `distutils` usage (#22531) · 80d1319e

Xuehai Pan authored Apr 04, 2023

* [setup] drop deprecated `distutils` usage

* drop deprecated `distutils.util.strtobool` usage

* fix import order

* reformat docstring by `doc-builder`

80d1319e

Fix missing metrics with multiple eval datasets (#22536) · 4c33a0c4
Ilya authored Apr 03, 2023

4c33a0c4

[`Trainer`] Force `is_model_parallel` when model is loaded in multiple GPUs... · cab048fb

Younes Belkada authored Apr 03, 2023

[`Trainer`] Force `is_model_parallel` when model is loaded in multiple GPUs using `accelerate` (#22532)

* add `is_model_parallel` arg on Trainer

* add warning

* adapt from suggestions

* revert t5 changes

* remove commas

* adapt from suggestions

cab048fb

29 Mar, 2023 1 commit
- Revert "Fix --bf16 option support for Neuron after PR #22300" (#22451) · 5e89a435
  jeffhataws authored Mar 29, 2023
```
This reverts commit fd81746dbec5f17c8285a0fdc72ca4b4c025cc33.
```
  5e89a435
23 Mar, 2023 2 commits

Fix --bf16 option support for Neuron after PR #22300 (#22307) · ec9b18f6

jeffhataws authored Mar 23, 2023

This PR fixes the "RuntimeError: No CUDA GPUs are available"
when running with --bf16 option on Neuron.

Related PRs:
https://github.com/huggingface/transformers/pull/20684
https://github.com/huggingface/transformers/pull/22300

ec9b18f6

Mention why one needs to specify max_steps in Trainer (#22333) · 053c2153
Quentin Lhoest authored Mar 23, 2023
```
* Mention why one needs to specify max_steps in Trainer

* dummy change to trigger CI
```
053c2153