Commits · 0edf30b87159e82048b5f248e4b379aebb8f364a · OpenDAS / TransformerEngine

07 Jun, 2024 1 commit

[PyTorch] Distributed intermediate/activation tensors for FSDP (#687) · 0edf30b8

Alp Dener authored Jun 07, 2024



* New TE wrapper for PyTorch FullyShardedDataParallel to make TE modules distribute their activations after the forward pass and gather them before the backward pass
Signed-off-by: Alp Dener <adener@nvidia.com>

* simplified TE module setup for FSDP comms
Signed-off-by: Alp Dener <adener@nvidia.com>

* FSDP scatter/gather for tensors saved into autograd ctx now working for base TE modules
Signed-off-by: Alp Dener <adener@nvidia.com>

* make sure activation recompute disables FSDP scatter/gather
Signed-off-by: Alp Dener <adener@nvidia.com>

* make sure Fp8 weight buffers are sharded at the end of the backward pass and gathered before forward
Signed-off-by: Alp Dener <adener@nvidia.com>

* Fixed typo in attribute name
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed bug in finding FSDP-wrapped TE modules
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed typo in fp8 weight tensor name
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed incorrect # of gradients
Signed-off-by: Alp Dener <adener@nvidia.com>

* Added fp8 amax gradient hook tensor to the parameter reset
Signed-off-by: Alp Dener <adener@nvidia.com>

* get rid of erroneous dummy tensor leftover from incorrect rebase
Signed-off-by: Alp Dener <adener@nvidia.com>

* Linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixing git snafu and removing debug statements
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

0edf30b8

19 Jan, 2024 1 commit
- chore: Fix multiple typos (#617) · e4f506a0
  hugo-syn authored Jan 19, 2024
```
Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>
```
  e4f506a0
17 Jan, 2024 1 commit

[PyTorch] Deferred Initialization via `device='meta'` option (#596) · 434d58fa

Alp Dener authored Jan 17, 2024



* Implemented deferred initialization via `device='meta'` option for te.Linear and added new PyTorch example to demonstrate its use with FullyShardedDataParallel execution.
Signed-off-by: Alp Dener <adener@nvidia.com>

* correcting Float8Tensor initialization and fixing linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* removed duplicate code from upstream rebase, local tests passing
Signed-off-by: Alp Dener <adener@nvidia.com>

* improved comments/documentation for FSDP example
Signed-off-by: Alp Dener <adener@nvidia.com>

* converted reset_parameters() into a base module function
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed Float8Tensor creation with deferred init, all tests passing locally
Signed-off-by: Alp Dener <adener@nvidia.com>

* extended deferred initialization to all TE modules
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* removed unnecessary reference to the parent module of parameter, added clarifying comments in parameter reset
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>

434d58fa