fixed the issue of DPO trainer that using one node and mutiple GPUs and set...

fixed the issue of DPO trainer that using one node and mutiple GPUs and set the device_map='auto' (#29695) * fixed the issue of DPO trainer that using one node and mutiple GPUs * before update, add the assert * run the ruff formatter * Update src/transformers/trainer.py Thank you. Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * remember to do make style and make quality before commit * Update src/transformers/trainer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

fixed the issue of DPO trainer that using one node and mutiple GPUs and set...
fixed the issue of DPO trainer that using one node and mutiple GPUs and set the device_map='auto' (#29695) * fixed the issue of DPO trainer that using one node and mutiple GPUs * before update, add the assert * run the ruff formatter * Update src/transformers/trainer.py Thank you. Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * remember to do make style and make quality before commit * Update src/transformers/trainer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
8692aa88 · Peng Wei · GitHub · 243d0de9 · 8692aa88
Unverified Commit 8692aa88 authored Mar 20, 2024 by Peng Wei Committed by GitHub Mar 20, 2024
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 0 deletions

src/transformers/trainer.py src/transformers/trainer.py +4 -0

No files found.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2124,6 +2124,10 @@ class Trainer:
                    # if loss is nan or inf simply add the average of previous logged losses
                    tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
                else:
+                    if tr_loss.device != tr_loss_step.device:
+                        raise ValueError(
+                            f"Calculated loss must be on the original device: {tr_loss.device} but device in use is {tr_loss_step.device}"
+                        )
                    tr_loss += tr_loss_step

                self.current_flos += float(self.floating_point_ops(inputs))