Commits · eed9ed679878ada2f6d2eefccdbda368cabc88b1 · chenpangpang / transformers

14 Jun, 2024 1 commit

xpu: support xpu backend from stock pytorch (>=2.4) (#31238) · eed9ed67

Dmitry Rogozhkin authored Jun 14, 2024

* xpu: support xpu backend from stock pytorch (>=2.4)

Fixes: https://github.com/huggingface/transformers/issues/31237

XPU backend is available in the stock PyTorch starting from
version 2.4, see [1]. This commit extends huggingface transformers
to support XPU from both IPEX and the stock pytorch. IPEX is being
tried first.

See: https://github.com/pytorch/pytorch/issues/114842
Requires: https://github.com/huggingface/accelerate/pull/2825

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

* xpu: enable gpt2 and decision_transformer tests for xpu pytorch backend

Note that running xpu tests requires TRANSFORMERS_TEST_DEVICE_SPEC=spec.py
passed to the test runner:

  import torch
  DEVICE_NAME = 'xpu'
  MANUAL_SEED_FN = torch.xpu.manual_seed
  EMPTY_CACHE_FN = torch.xpu.empty_cache
  DEVICE_COUNT_FN = torch.xpu.device_count
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

---------
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

eed9ed67

07 Jun, 2024 1 commit

Implement JSON dump conversion for torch_dtype in TrainingArguments (#31224) · 60861fe1

조준래 authored Jun 07, 2024



* Implement JSON dump conversion for torch_dtype in TrainingArguments

* Add unit test for converting torch_dtype in TrainingArguments to JSON

* move unit test for converting torch_dtype into TrainerIntegrationTest class

* reformating using ruff

* convert dict_torch_dtype_to_str to private method _dict_torch_dtype_to_str

---------
Co-authored-by: jun.4 <jun.4@kakaobrain.com>

60861fe1

03 Jun, 2024 2 commits
- Set greater_is_better to False if metric_for_best_model ends with "loss" (#31142) · df5abae8
  miivanov90 authored Jun 03, 2024
```
* update to not(endswith(loss))

* ruff formatting
```
  df5abae8
- Rename sanity_evaluation to eval_on_start (#31192) · c6c78733
  Qubitium authored Jun 03, 2024
```
* Rename sanity_evaluation to eval_on_start

* move arg back to last
```
  c6c78733
31 May, 2024 1 commit

[trainer] add sanity evaluation option (#31146) · f8e6ba45

Marc Sun authored May 31, 2024



* add sanity evaluation

* fix

* Apply suggestions from code review
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* fix

---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

f8e6ba45

28 May, 2024 1 commit

Remove redundant backend checks in training_args.py (#30999) · 537deb78

Hengwen Tong authored May 28, 2024



* Remove backend checks in training_args.py

* Expilicit initialize the device

---------
Co-authored-by: tonghengwen <tonghengwen@cambricon.com>

537deb78

23 May, 2024 1 commit

Add a check that warmup_setps is either 0 or >= 1 (#30764) · 892b13d3

Yasmin Moslem authored May 23, 2024



* Add a check that warmup_setps is either 0 or >= 1

Update training_args.py to add a check that warmup_setps is either 0 or >= 1. Otherwise, raise an error.

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

892b13d3

21 May, 2024 2 commits

Enforce saving at end of training if saving option chosen (#30160) · daf281f4

Zach Mueller authored May 21, 2024

* Enforce saving at end of training

* Fix test

* Rework test

* Fixup tests'

* Update comment based on sourab feedback

* Clean

daf281f4

FEAT / Trainer: LOMO optimizer support (#30178) · 8871b261

Younes Belkada authored May 21, 2024



* add V1 - adalomo not working yet

* add todo docs + refactor from comments

* adjust LR

* add docs

* add more elaborated test

* Apply suggestions from code review
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* fix

* push

* add accelerate check

* fix DDP case

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix

* init kwargs

* safely add attribute

* revert to enum logic

* Update src/transformers/trainer.py

---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

8871b261

20 May, 2024 1 commit

Introduce configured_state arg for accelerator_config (#29781) · 92d1d97c

Zach Mueller authored May 20, 2024



* Introduce configured_state

* Include note on tuning

* Allow for users to have defined a state already

* Include tests

* Add note on hpam tune

* Guard a bit better

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Finish rebase

* Finish rebase

* Guard carefully

* Fixup test

* Refactor

* Fin refactor

* Comment

* Update wrt feedback

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

92d1d97c

13 May, 2024 1 commit

CI: update to ROCm 6.0.2 and test MI300 (#30266) · 37bba2a3

fxmarty authored May 13, 2024



* update to ROCm 6.0.2 and test MI300

* add callers for mi300

* update dockerfile

* fix trainer tests

* remove apex

* style

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* update to torch 2.3

* add workflow dispatch target

* we may need branches: mi300-ci after all

* nit

* fix docker build

* nit

* add check runner

* remove docker-gpu

* fix issues

* fix

---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

37bba2a3

06 May, 2024 1 commit

Trainer - add cache clearing and the option for batched eval metrics computation (#28769) · df475bf8

Nate Cibik authored May 06, 2024

* Added cache clearing for GPU efficiency.

* Added cache clearing for GPU efficiency.

* Added batch_eval_metrics capability

* Ran make fixup

* Fixed bug

* Fixed whitespace issue

* Fixed outdated condition

* Updated docstrings with instructions for batch_eval_metrics. Updated end of dataloader logic

* Added first version of batch_eval_metrics Trainer test

* Fixed batch_eval_metrics Trainer tests for both eval and predict

* Fixed batch_eval_metrics behavior for new Trainer variables

* Fixed batch_eval_metrics Trainer tests

* Ran fixup

df475bf8

03 May, 2024 1 commit
- Fix W&B run name (#30462) · 66f675eb
  Pavel Iakubovskii authored May 03, 2024
```
* Remove comparison to output_dir

* Update docs for `run_name`

* Add warning
```
  66f675eb
02 May, 2024 1 commit
- Fix for Neuron (#30259) · fbabd674
  Michael Benayoun authored May 02, 2024
  
  fbabd674
29 Apr, 2024 1 commit
- Allow boolean FSDP options in fsdp_config (#30439) · 80126f98
  Howard Liberty authored Apr 29, 2024
```
* Allow boolean FSDP options in fsdp_config

* Use lower() to be safe
```
  80126f98
25 Apr, 2024 1 commit

Introduce Stateful Callbacks (#29666) · ad697f18

Zach Mueller authored Apr 25, 2024



* Introduce saveable callbacks

* Add note

* Test for non-present and flag

* Support early stopping and refusing to train further

* Update docstring

* More saving

* Import oopsie

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Make it go through TrainerArguments

* Document

* Fix test

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Rework to allow for duplicates

* CLean

* Fix failing tests

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

ad697f18

24 Apr, 2024 1 commit

Enable fp16 on CPU (#30459) · 5c57463b

Zach Mueller authored Apr 24, 2024

* Check removing flag for torch

* LLM oops

* Getting there...

* More discoveries

* Change

* Clean up and prettify

* Logic check

* Not

5c57463b

22 Apr, 2024 1 commit

Add FSDP config for CPU RAM efficient loading through accelerate (#30002) · f16caf44

Howard Liberty authored Apr 22, 2024



* Add FSDP config for CPU RAM efficient loading

* Style fix

* Update src/transformers/training_args.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Add sync_module_states and cpu_ram_efficient_loading validation logic

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Style

---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

f16caf44

18 Apr, 2024 1 commit
- 🚨🚨🚨Deprecate `evaluation_strategy` to `eval_strategy`🚨🚨🚨 (#30190) · 60d5f8f9
  Zach Mueller authored Apr 18, 2024
```
* Alias

* Note alias

* Tests and src

* Rest

* Clean

* Change typing?

* Fix tests

* Deprecation versions
```
  60d5f8f9
17 Apr, 2024 1 commit

Add strategy to store results in evaluation loop (#30267) · c15aad09

Pavel Iakubovskii authored Apr 17, 2024

* Add evaluation loop container for interm. results

* Add tests for EvalLoopContainer

* Formatting

* Fix padding_index in test and typo

* Move EvalLoopContainer to pr_utils to avoid additional imports

* Fix `eval_do_concat_batches` arg description

* Fix EvalLoopContainer import

c15aad09

16 Apr, 2024 2 commits

Raise relevent err when wrong type is passed in as the accelerator_config (#29997) · e27d9308
Zach Mueller authored Apr 16, 2024
```
* Raise relevent err

* Use type instead
```
e27d9308

Allow for str versions of dicts based on typing (#30227) · 487505ff

Zach Mueller authored Apr 16, 2024

* Bookmark, initial impelemtation. Need to test

* Clean

* Working fully, woop woop

* I think working version now, testing

* Fin!

* rm cast, could keep None

* Fix typing issue

* rm typehint

* Add test

* Add tests and make more rigid

487505ff

10 Apr, 2024 1 commit

Add str to TrainingArguments report_to type hint (#30078) · b7d002bd

Matthew Hoffman authored Apr 10, 2024

* Add str to TrainingArguments report_to type hint

* Swap order in Union

* Merge Optional into Union

https://github.com/huggingface/transformers/pull/30078#issuecomment-2042227546

b7d002bd

03 Apr, 2024 1 commit

Make clearer about zero_init requirements (#29879) · 863e2562

Zach Mueller authored Apr 03, 2024



* Docstring to note about zero init

* Check for accelerate

* Change conditional return

* Tweak

* Add new accelerate-specific zero3 check

* Fix import

* Revert to RTFM

* Update src/transformers/modeling_utils.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

863e2562

27 Mar, 2024 1 commit

add Cambricon MLUs support (#29627) · 75769744

huismiling authored Mar 27, 2024

* add Cambricon MLUs support

* fix mlu device rng state

* up for quality check

* up mlu to support fp16

* fix mlu device dependency error

* fix mlu device dependency error

* enable mlu device for bf16

* fix mlu device memory tracker

75769744

19 Mar, 2024 1 commit

FEAT / Optim: Add GaLore optimizer (#29588) · f6261d7d

Younes Belkada authored Mar 19, 2024



* add galore v1

* add import

* add tests and doc

* fix doctest

* forward contrib credits from discussions

* forward contrib credits from discussions

* Apply suggestions from code review
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* fix failing tests'

* switch to `optim_target_modules` and clarify docs

* more clarification

* enhance lookup logic

* update a test to add peak memory

* add regex, all-linear and single string support

* add layer-wise optimization through DummyOptimizers and LRSchedulers

* forward contrib credits from discussions and original idea

* add a section about DDP not supported in layerwise

* Update src/transformers/trainer.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* fix self

* check only if layer_wise

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* oops

* make use of intervals

* clarify comment

* add matching tests

* GaLoRe -> GaLore

* move to `get_scheduler`

* add note on docs

* add a warning

* adapt a bit the docs

* update docstring

* support original API

* Update docs/source/en/trainer.md

* slightly refactor

* Update docs/source/en/trainer.md
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix args parsing and add tests

* remove warning for regex

* fix type hint

* add note about extra args

* make `is_regex` return optional

---------

Co-authored-by: Maxime <maximegmd @users.noreply.github.com>
Co-authored-by: Wing Lian <winglian @users.noreply.github.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: hiyouga <hiyouga@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>

f6261d7d

13 Mar, 2024 1 commit

Add support for FSDP+QLoRA and DeepSpeed ZeRO3+QLoRA (#29587) · 350c5d15

Sourab Mangrulkar authored Mar 13, 2024



* fsdp+qlora related changes

* fixes

* Update quantization_config.py

* support fsdp+qlora and dsz3+qlora

* Update quantization_config.py

* Update modeling_utils.py

* Update modeling_utils.py

* Update modeling_utils.py

* Update modeling_utils.py

* Update modeling_utils.py

* Update modeling_utils.py

* handle fsdp+qlora and dsz3+qlora correctly while model loading

* fix param count

* quality

* fsdp related changes

* fsdp changes only when using LoRA/QLoRA

* add accelerate version check

* refactor, update min accelerate version and add tests

1. Update minimum accelerate version to 0.26.0
2. Clean the trainer wrt accelerate version checks
3. FSDP refactor and test for fsdp config
4. use `itemsize` instead of `dtype2bytes` dict

* fix test

* Address comments
Co-Authored-By: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fix the conditional flag

* fix conditional flag

* address comments
Co-Authored-By: Zach Mueller <7831895+muellerzr@users.noreply.github.com>

---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Zach Mueller <7831895+muellerzr@users.noreply.github.com>

350c5d15

11 Mar, 2024 1 commit

Make torch xla available on GPU (#29334) · 873d9bb3

Yitong Huang authored Mar 11, 2024



* add USE_TORCH_XLA env

* rename torch_tpu to torch_xla

* better is_torch_xla_available; fix some fsdp and performance issues

* fix format

* fix bug when pjrt_device is cpu

* fix bug

* fix the deprecation handling

---------
Co-authored-by: anw90 <ang868@gmail.com>
Co-authored-by: wangang.wa <wangang.wa@alibaba-inc.com>

873d9bb3

08 Mar, 2024 1 commit
- fix typos in FSDP config parsing logic in `TrainingArguments` (#29189) · 697f05ba
  Yun Dai authored Mar 08, 2024
```
fix FSDP config
```
  697f05ba
06 Mar, 2024 1 commit

Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor (#29447) · 2890116a

Matthew Hoffman authored Mar 06, 2024

* Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor

dataloader_prefetch_factor was added to TrainingArguments in #28498 with the default value None, but  versions of torch<2.0.0 do not accept None and will raise an error if num_workers == 0 and prefetch_factor != 2

* Add is_torch_available() check

* Use is_torch_greater_or_equal_than_2_0

add back check for dataloader_prefetch_factor

2890116a

01 Mar, 2024 1 commit

Fix deprecated arg issue (#29372) · 1a7c117d

Zach Mueller authored Mar 01, 2024

* Fix deprecated arg issue

* Trainer check too

* Check for dict or dataclass

* Simplify, make config always AcceleratorConfig

* Upstream to Trainer

1a7c117d

20 Feb, 2024 1 commit

FEAT [`Trainer` / `bnb`]: Add RMSProp from `bitsandbytes` to HF `Trainer` (#29082) · f7ef7cec

Younes Belkada authored Feb 20, 2024



* add RMSProp to Trainer

* revert some change

* Update src/transformers/trainer.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

f7ef7cec

14 Feb, 2024 2 commits

[TPU] Support PyTorch/XLA FSDP via SPMD (#28949) · 5f06053d

Jiewen Tan authored Feb 14, 2024

* Initial commit

* Add guards for the global mesh

* Address more comments

* Move the dataloader into integrations/tpu.py

* Fix linters

* Make karg more explicitly

* Remove the move device logic

* Fix the CI

* Fix linters

* Re-enable checkpointing

5f06053d

Introduce AcceleratorConfig dataclass (#28664) · 0507e69d

Zach Mueller authored Feb 14, 2024



* Introduce acceleratorconfig dataclass

* Extra second warn

* Move import

* Try moving import under is_accelerate_available

* Quality

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Clean

* Remove to_kwargs

* Change version

* Improve tests by including dispatch and split batches

* Improve reliability

* Update tests/trainer/test_trainer.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Fixup tests and review nits

* Make tests pass

* protect import

* Protect import

* Empty-Commit

* Make training_args.to_dict handle the AcceleratorConfig

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

0507e69d

09 Feb, 2024 1 commit
- Fix type annotations on neftune_noise_alpha and fsdp_config TrainingArguments parameters (#28942) · d123e661
  Philip Blair authored Feb 09, 2024
  
  d123e661
07 Feb, 2024 1 commit
- fix: Fixed the documentation for `logging_first_step` by removing "evaluate" (#28884) · 64d1518c
  Sai-Suraj-27 authored Feb 07, 2024
```
Fixed the documentation for logging_first_step by removing evaluate.
```
  64d1518c
05 Feb, 2024 1 commit
- [Docs] Fix bad doc: replace save with logging (#28855) · c430d6ea
  Zizhao Chen authored Feb 04, 2024
```
Fix bad doc: replace save with logging
```
  c430d6ea
23 Jan, 2024 1 commit

add dataloader prefetch factor in training args and trainer (#28498) · 5b5e71dc

Quentin Meeus authored Jan 23, 2024



* add dataloader prefetch factor in training args and trainer

* remove trailing spaces

* prevent dataloader_num_workers == 0 and dataloader_prefetch_factor != None

dataloader_prefetch_factor works only when data is loaded in a different process as the main one. This commit adds the necessary checks to avoid having prefetch_factor set when there is no such process.

* Remove whitespaces in empty line

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

5b5e71dc

19 Jan, 2024 1 commit
- Fix wrong xpu device in DistributedType.MULTI_XPU mode (#28386) · 8db64367
  Fanli Lin authored Jan 19, 2024
```
* remove elif xpu

* remove redudant code
```
  8db64367
12 Jan, 2024 1 commit
- TF: purge `TFTrainer` (#28483) · 4fb3d3a0
  Joao Gante authored Jan 12, 2024
  
  4fb3d3a0