1. 14 Jun, 2024 1 commit
  2. 07 Jun, 2024 1 commit
  3. 03 Jun, 2024 2 commits
  4. 31 May, 2024 1 commit
  5. 28 May, 2024 1 commit
  6. 23 May, 2024 1 commit
  7. 21 May, 2024 2 commits
  8. 20 May, 2024 1 commit
  9. 13 May, 2024 1 commit
    • fxmarty's avatar
      CI: update to ROCm 6.0.2 and test MI300 (#30266) · 37bba2a3
      fxmarty authored
      
      
      * update to ROCm 6.0.2 and test MI300
      
      * add callers for mi300
      
      * update dockerfile
      
      * fix trainer tests
      
      * remove apex
      
      * style
      
      * Update tests/trainer/test_trainer_seq2seq.py
      
      * Update tests/trainer/test_trainer_seq2seq.py
      
      * Update tests/trainer/test_trainer_seq2seq.py
      
      * Update tests/trainer/test_trainer_seq2seq.py
      
      * update to torch 2.3
      
      * add workflow dispatch target
      
      * we may need branches: mi300-ci after all
      
      * nit
      
      * fix docker build
      
      * nit
      
      * add check runner
      
      * remove docker-gpu
      
      * fix issues
      
      * fix
      
      ---------
      Co-authored-by: default avatarYih-Dar <2521628+ydshieh@users.noreply.github.com>
      Co-authored-by: default avatarydshieh <ydshieh@users.noreply.github.com>
      37bba2a3
  10. 06 May, 2024 1 commit
    • Nate Cibik's avatar
      Trainer - add cache clearing and the option for batched eval metrics computation (#28769) · df475bf8
      Nate Cibik authored
      * Added cache clearing for GPU efficiency.
      
      * Added cache clearing for GPU efficiency.
      
      * Added batch_eval_metrics capability
      
      * Ran make fixup
      
      * Fixed bug
      
      * Fixed whitespace issue
      
      * Fixed outdated condition
      
      * Updated docstrings with instructions for batch_eval_metrics. Updated end of dataloader logic
      
      * Added first version of batch_eval_metrics Trainer test
      
      * Fixed batch_eval_metrics Trainer tests for both eval and predict
      
      * Fixed batch_eval_metrics behavior for new Trainer variables
      
      * Fixed batch_eval_metrics Trainer tests
      
      * Ran fixup
      df475bf8
  11. 03 May, 2024 1 commit
  12. 02 May, 2024 1 commit
  13. 29 Apr, 2024 1 commit
  14. 25 Apr, 2024 1 commit
  15. 24 Apr, 2024 1 commit
    • Zach Mueller's avatar
      Enable fp16 on CPU (#30459) · 5c57463b
      Zach Mueller authored
      * Check removing flag for torch
      
      * LLM oops
      
      * Getting there...
      
      * More discoveries
      
      * Change
      
      * Clean up and prettify
      
      * Logic check
      
      * Not
      5c57463b
  16. 22 Apr, 2024 1 commit
  17. 18 Apr, 2024 1 commit
  18. 17 Apr, 2024 1 commit
    • Pavel Iakubovskii's avatar
      Add strategy to store results in evaluation loop (#30267) · c15aad09
      Pavel Iakubovskii authored
      * Add evaluation loop container for interm. results
      
      * Add tests for EvalLoopContainer
      
      * Formatting
      
      * Fix padding_index in test and typo
      
      * Move EvalLoopContainer to pr_utils to avoid additional imports
      
      * Fix `eval_do_concat_batches` arg description
      
      * Fix EvalLoopContainer import
      c15aad09
  19. 16 Apr, 2024 2 commits
  20. 10 Apr, 2024 1 commit
  21. 03 Apr, 2024 1 commit
  22. 27 Mar, 2024 1 commit
    • huismiling's avatar
      add Cambricon MLUs support (#29627) · 75769744
      huismiling authored
      * add Cambricon MLUs support
      
      * fix mlu device rng state
      
      * up for quality check
      
      * up mlu to support fp16
      
      * fix mlu device dependency error
      
      * fix mlu device dependency error
      
      * enable mlu device for bf16
      
      * fix mlu device memory tracker
      75769744
  23. 19 Mar, 2024 1 commit
  24. 13 Mar, 2024 1 commit
  25. 11 Mar, 2024 1 commit
  26. 08 Mar, 2024 1 commit
  27. 06 Mar, 2024 1 commit
    • Matthew Hoffman's avatar
      Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor (#29447) · 2890116a
      Matthew Hoffman authored
      * Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor
      
      dataloader_prefetch_factor was added to TrainingArguments in #28498 with the default value None, but  versions of torch<2.0.0 do not accept None and will raise an error if num_workers == 0 and prefetch_factor != 2
      
      * Add is_torch_available() check
      
      * Use is_torch_greater_or_equal_than_2_0
      
      add back check for dataloader_prefetch_factor
      2890116a
  28. 01 Mar, 2024 1 commit
    • Zach Mueller's avatar
      Fix deprecated arg issue (#29372) · 1a7c117d
      Zach Mueller authored
      * Fix deprecated arg issue
      
      * Trainer check too
      
      * Check for dict or dataclass
      
      * Simplify, make config always AcceleratorConfig
      
      * Upstream to Trainer
      1a7c117d
  29. 20 Feb, 2024 1 commit
  30. 14 Feb, 2024 2 commits
  31. 09 Feb, 2024 1 commit
  32. 07 Feb, 2024 1 commit
  33. 05 Feb, 2024 1 commit
  34. 23 Jan, 2024 1 commit
  35. 19 Jan, 2024 1 commit
  36. 12 Jan, 2024 1 commit